Talk
Reimaging Gradient Descent: Large Stepsize, Oscillation, and Acceleration
- Jingfeng Wu (UC Berkeley)
Abstract
Gradient descent (GD) and its variants are pivotal in machine learning. Conventional wisdom suggests smaller stepsizes for stability, yet in practice, larger stepsizes often yield faster convergence, despite initial instability. In this talk, I will explain how large stepsizes provably accelerate GD for logistic regression on separable data in three settings.
- We start with logistic regression, a convex but non-strongly convex problem. We show that GD with a large stepsize attains an
-risk in steps. This matches the accelerated step complexity of Nesterov momentum and improves the classical step complexity for GD with a small stepsize. - We then consider
-regularized logistic regression with regularization strength , a strongly convex problem with condition number . We show GD with a large stepsize attains an -excess risk in steps. This, again, matches Nesterov momentum and improves the step complexity for GD with a small stepsize. - Finally, we consider the task of finding a linear separator of a linearly separable dataset with margin
. We show that, with large and adaptive stepsizes, GD solves this task in steps by minimizing the logistic risk. We further show this step complexity is minimax optimal for all first-order methods, and cannot be achieved by GD with small stepsizes.