Implicit biases of large learning rates in machine learning
- Molei Tao (Georgia Institute of Technology)
This talk will discuss some nontrivial but often pleasant effects of large learning rates, which are commonly used in machine learning practice for improved empirical performances, but defy traditional theoretical analyses. I will first quantify how large learning rates can help gradient descent in multiscale landscape escape local minima, via chaotic dynamics, which provides an alternative to the commonly known escape mechanism due to noises from stochastic gradients. I will then report how large learning rates provably bias toward flatter minimizers. Several related, perplexing phenomena have been empirically observed recently, including Edge of Stability, loss catapulting, and balancing. I will, for the first time, unify them and explain that they are all algorithmic implicit biases of large learning rates. These results are enabled by a first global convergence proof of gradient descent for certain nonconvex functions without Lipschitz gradient. This theory will also provide understanding of when there will be Edge of Stability and other large learning rate implicit biases.