Zusammenfassung für den Vortrag am 16.06.2022 (17:00 Uhr)Math Machine Learning seminar MPI MIS + UCLA
Jingfeng Wu (Johns Hopkins University)
A Fine-Grained Characterization for the Implicit Bias of SGD in Least Square Problems
Siehe auch das Video dieses Vortrages.
Siehe auch die Vortragsfolien dieses Vortrages.
There is an increasing realization that algorithmic inductive biases are central in preventing overfitting; empirically, we often see that machine learning models trained by stochastic gradient descent (SGD) generalize well, even in the overparameterized settings and under little to no explicit regularization. This work considers this issue in arguably the most basic setting: averaged or last-iterate SGD for linear regression in the overparameterized regime. Our main result provides a sharp excess risk bound, stated in terms of the full eigenspectrum of the data covariance matrix, that reveals a bias-variance decomposition characterizing when generalization is possible: (i) the variance bound is characterized in terms of an effective dimension (specific for SGD) and (ii) the bias bound is characterized in terms of the optimal model parameter and how it aligns with the data covariance matrix. Moreover, for SGD with iterate averaging, we prove a matching lower bound (up to constant factors); for last iterate SGD (with decaying stepsizes), we provide a matching lower bound (up to constant factors) when the signal-to-noise ratio is bounded.
The above results have the following additional implications:
(1) We compare the implicit regularization afforded by SGD with the explicit regularization of ridge regression. We find that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances.
(2) We study the effects of transfer learning with pretraining and finetuning in the presence of covariate shift. We find that for a large class of linear regression instances, transfer learning with O(N^2) source data (and scarce or no target data) is as effective as supervised learning with N target data.