Talk
Where Does Mini-Batch SGD Converge?
- Pierfrancesco Beneventano (MIT)
Abstract
Training neural networks relies on mini-batch gradient methods navigating non-convex objectives containing multiple manifolds of minima, each leading to different real-world performance. Which minima do these algorithms reach, and how do hyper-parameters steer this outcome? First, we show a way in which SGD implicitly regularizes the features learned by neural networks. Next, we show that running SGD without replacement is locally equivalent to taking an extra step along a novel regularizer. Finally, we introduce a method to characterize the convergence points for small linear networks.