Towards Building a Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

Umut Şimşekli (INRIA - École Normale Supérieure (Paris))

Live Stream

Abstract

In this talk, I will focus on the 'tail behavior' in SGD in deep learning. I will first empirically illustrate that heavy tails arise in the gradient noise (i.e., the difference between the stochastic gradient and the true gradient). Accordingly I will propose to model the gradient noise as a heavy-tailed α-stable random vector, and accordingly propose to analyze SGD as a discretization of a stochastic differential equation (SDE) driven by a stable process. As opposed to classical SDEs that are driven by a Brownian motion, SDEs driven by stable processes can incur ‘jumps’, which force the SDE (and its discretization) transition from 'narrow minima' to 'wider minima', as proven by existing metastability theory and the extensions that we proved recently. These results open up a different perspective and shed more light on the view that SGD 'prefers' wide minima. In the second part of the talk, I will focus on the generalization properties of such heavy-tailed SDEs and show that the generalization error can be controlled by the Hausdorff dimension of the trajectories of the SDE, which is closely linked to the tail behavior of the driving process. Our results imply that heavier-tailed processes should achieve better generalization; hence, the tail-index of the process can be used as a notion of "capacity metric”. Finally, I will talk about the 'originating cause' of such heavy-tailed behavior and present theoretical results which show that heavy-tails can even emerge in very sterile settings such as linear regression with iid Gaussian data.

The talk will be based on the following papers:

U. Şimşekli, L. Sagun, M. Gürbüzbalaban, "A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks", ICML 2019

T. H, Nguyen, U. Şimşekli, M. Gürbüzbalaban, G. Richard, "First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise", NeurIPS 2019

U. Şimşekli, O. Sener, G. Deligiannidis, M. A. Erdogdu, "Hausdorff Dimension, Stochastic Differential Equations, and Generalization in Neural Networks", NeurIPS 2020

M. Gurbuzbalaban, U. Simsekli, L. Zhu, "The Heavy-Tail Phenomenon in SGD", arXiv, 2020

Links

seminar

14.08.25 02.10.25

Math Machine Learning seminar MPI MIS + UCLA Math Machine Learning seminar MPI MIS + UCLA

MPI for Mathematics in the Sciences Live Stream

See Details

Upcoming Events of this Seminar

Thursday, 14.08.25 to be announced with Jonathan Siegel
Thursday, 02.10.25 to be announced with Marcello Carioni