Asymptotics of Learning in Generalized Linear Models and Recurrent Neural Networks
- Melikasadat Emami (UCLA)
Modern machine learning models, particularly those used in deep networks, are characterized by massive numbers of parameters trained on large data sets. While these large-scale models have had tremendous practical successes, developing theoretical methods that can rigorously explain when and why these models work, has been a major issue in the field. This task is even made harder by the non-convexity of the underlying learning problems. In this talk, we shed light on the theoretical understanding of the asymptotics of learning for two popular neural network models, namely, Generalized Linear Models (GLMs) and Recurrent Neural Networks (RNNs).
First, we investigate the generalizability of single-layer neural networks (i.e., GLMs) over previously unseen data. We provide a general framework to characterize the asymptotic generalization error for GLMs with arbitrary non-linearities, making it applicable to regression as well as classification problems. This framework enables analyzing the effect of
(i) over-parameterization and non-linearity during modeling;
(ii) choices of the loss function, initialization, and regularizer during learning; and
(iii) mismatch between training and test distributions.
We also rigorously and analytically explain the double descent phenomenon in generalized linear models.
Secondly, we explore the learning dynamics of recurrent neural networks under gradient descent. We focus on a subclass of RNNs with linear activations and provide precise reasoning for the common knowledge suggesting that RNNs do not perform well on tasks requiring long-term memory. Using recently-developed kernel regime analysis, our main result shows that linear RNNs learned from random initializations are functionally equivalent to a certain weighted 1D-convolutional network. Importantly, the weightings in the equivalent model cause an implicit bias to elements with smaller time lags in the convolution and hence shorter memory. We show that the degree of this bias depends on the variance of the transition matrix at initialization. Our theories are validated with both synthetic and real data experiments.