Feature Learning in Shallow Neural Networks: From Theory to Optimization Algorithms
- Behrad Moniri (University of Pennsylvania)
Abstract
In this talk, we study the problem of feature learning in shallow neural networks. In the first part of the talk, we review the fundamental limitations of learning using two-layer neural networks with the first-layer weights fixed at initialization -- a model that does not learn features. We demonstrate that, in the high-dimensional proportional limit, this model is unable to learn nonlinear functions, a property known as Gaussian Equivalence. We then show that even a single step of gradient descent applied to the first layer can drastically alter this scenario. Through a precise analysis of the spectrum of the feature matrix, we illustrate how this one-step update breaks Gaussian Equivalence.
In the second part of the talk, we revisit the problem of a two-layer network updated by one gradient descent step and also consider linear representation learning, another widely studied model of feature learning. We establish that in these problems, gradient descent is a suboptimal feature-learning algorithm under general input conditions beyond the typical isotropic assumption. Furthermore, we show that layer-wise preconditioning optimization methods emerge as the natural solution. Thus, we provide the first learning-theoretic motivations for these popular deep learning optimization algorithms.