Implicit bias of gradient descent for mean squared error regression with wide neural networks
Abstract
We investigate gradient descent training of wide neural networks and the corresponding implicit bias in function space. For univariate regression, the solution of training a width-n shallow ReLU network is within n^−1/2 of the function which fits the training data and whose difference from the initial function has the smallest 2-norm of the second derivative weighted by a curvature penalty that depends on the probability distribution that is used to initialize the network parameters. We compute the curvature penalty function explicitly for various common initialization procedures. For multivariate regression we show an analogous result, whereby the second derivative is replaced by the Radon transform of a fractional Laplacian. For initialization schemes that yield a constant penalty function, the solutions are polyharmonic splines. Moreover, we obtain results for different activations and show that the training trajectories are captured by trajectories of smoothing splines with decreasing regularization strength.
This is joint work with Hui Jin arxiv.org/abs/2006.07356.