Depth Dependence of $\mu$P Learning Rates in ReLU MLPs. (arXiv:2305.07810v1 [cs.LG])

Click here to flash read.

In this short note we consider random fully connected ReLU networks of width
$n$ and depth $L$ equipped with a mean-field weight initialization. Our purpose
is to study the dependence on $n$ and $L$ of the maximal update ($\mu$P)
learning rate, the largest learning rate for which the mean squared change in
pre-activations after one step of gradient descent remains uniformly bounded at
large $n,L$. As in prior work on $\mu$P of Yang et. al., we find that this
maximal update learning rate is independent of $n$ for all but the first and
last layer weights. However, we find that it has a non-trivial dependence of
$L$, scaling like $L^{-3/2}.$

Click here to read this post out

ID: 129507; Unique Viewers: 0

Unique Voters: 0

Total Votes: 0

Votes:

Latest Change: May 16, 2023, 7:31 a.m. Changes:

/u/anonymous

Dictionaries:

Words:

Spaces:

CC:
No creative common's license

Comments: