Reverse-engineering implicit regularization due to large learning rates in deep learning

Stanisław Jastrzębski (Molecule.one and Jagiellonian University)

Live Stream

Abstract

The early phase of training of deep neural networks has a dramatic and counterintuitive effect on the local curvature of the loss function. For instance, we found that using a small learning rate does not guarantee stable optimization because the optimization trajectory has a tendency to steer towards regions of the loss surface with increasing local curvature. It is equally surprising that using a small learning rate impacts negatively generalization. I will discuss our journey in understanding these and other phenomena. The focus of the talk will be our mechanistic explanation for how using a large learning rate impacts generalization, which we corroborate by developing an explicit regularizer that reproduces its implicit regularization effects [1,2].

[1] The Break-Even Point on Optimization Trajectories of Deep Neural Networks, Jastrzebski et al, ICLR 2020

[2] Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization, Jastrzebski et al, ICML 2021

Links

seminar

03.07.25 02.10.25

Math Machine Learning seminar MPI MIS + UCLA Math Machine Learning seminar MPI MIS + UCLA

MPI for Mathematics in the Sciences Live Stream

See Details

Upcoming Events of this Seminar

Thursday, 03.07.25 On the Power of Context-Enhanced Learning in LLMs with Xingyu Zhu
Thursday, 10.07.25 The effect of low rank and stochasticity on Gradient Descent at the Edge of Stability with Avrajit Ghosh a.o.
Thursday, 14.08.25 to be announced with Jonathan Siegel
Thursday, 02.10.25 to be announced with Marcello Carioni