Multi-armed bandits for resource efficient, online optimization of language model pre-training: the use case of dynamic masking. (arXiv:2203.13151v2 [cs.CL] UPDATED)

Click here to flash read.

We design and evaluate a Bayesian optimization framework for resource
efficient pre-training of Transformer-based language models (TLMs). TLM
pre-training requires high computational resources and introduces many
unresolved design choices, such as selecting its pre-training hyperparameters.
We propose a multi-armed bandit framework for the sequential selection of TLM
pre-training hyperparameters, aimed at optimizing language model performance,
in a resource efficient manner. We design a Thompson sampling algorithm, with a
surrogate Gaussian process reward model of the Masked Language Model (MLM)
pre-training objective, for its sequential minimization. Instead of MLM
pre-training with fixed masking probabilities, the proposed Gaussian
process-based Thompson sampling (GP-TS) accelerates pre-training by
sequentially selecting masking hyperparameters that improve performance. We
empirically demonstrate how GP-TS pre-trains language models efficiently, i.e.,
it achieves lower MLM loss in fewer epochs, across a variety of settings. In
addition, GP-TS pre-trained TLMs attain competitive downstream performance,
while avoiding expensive hyperparameter grid search. GP-TS provides an
interactive framework for efficient and optimized TLM pre-training that, by
circumventing costly hyperparameter selection, enables substantial
computational savings.

Click here to read this post out

ID: 164387; Unique Viewers: 0

Unique Voters: 0

Total Votes: 0

Votes:

Latest Change: May 31, 2023, 7:31 a.m. Changes:

/u/anonymous

Dictionaries:

Words:

Spaces:

CC:
No creative common's license

Comments: