Click here to flash read.
We present a theory for the previously unexplained divergent behavior noticed
in the training of large language models. We argue that the phenomenon is an
artifact of the dominant optimization algorithm used for training, called Adam.
We observe that Adam can enter a state in which the parameter update vector has
a relatively large norm and is essentially uncorrelated with the direction of
descent on the training loss landscape, leading to divergence. This artifact is
more likely to be observed in the training of a deep model with a large batch
size, which is the typical setting of large-scale language model training. To
argue the theory, we present observations from the training runs of the
language models of different scales: 7 billion, 30 billion, 65 billion, and 546
billion parameters.
No creative common's license