Learning Dynamics of Pretraining and Finetuning for Linear Models
- Ziqing Xu (University of Pennsylvania)
Abstract
In this talk, we study the learning dynamics and convergence rates of gradient-based methods for two-layer linear models. We first study the dynamics of gradient descent during pretraining and show that the optimization problem satisfies a local PL condition and a local Descent Lemma, which lead to a linear convergence rate for a suitable choice of the step sizes. Compared to prior work, our results require no restrictive assumptions on width, initialization, or step sizes, and achieve faster convergence rates. Next, we examine the finetuning stage through the lens of low-rank adaptation for matrix factorization. We show that gradient flow converges to a neighborhood of the optimal solution and that smaller initializations yield lower final errors. Our analysis reveals how the final error depends on the misalignment between the singular spaces of the pretrained model and the target matrix, and it highlights how reducing the initialization scale can improve alignment and hence performance.