Theory of Stable Scaling in Deep Learning Systems: Infinite Limits, Hyperparameter Transfer, and Scaling Laws
- Blake Bordelon (Harvard)
Abstract
Deep learning systems currently form the backbone of modern artificial intelligence. In recent years, significant advances have been made by scaling up both the size of the deep learning models and the amount of compute and data used during optimization. Despite the successes achieved by scaling up deep learning systems, common scaling strategies often lead to significant amounts of wasted compute. Naive protocols for scaling up require re-tuning hyperparameters (such as learning rate, batch size, parameter initialization, etc) with grid searches as one varies the size of the model or the amount of steps of training, which can quickly become computationally prohibitive. This motivates theoretically principled scaling protocols that enable predictable behavior across scales. In this talk, we discuss recent theoretical works that aim to derive stable scaling strategies by identifying stable, feature-learning infinite width and depth limits of neural networks. The asymptotic description of randomly initialized networks in this regime will take the form of a dynamical mean field theory (DMFT). We will discuss how adoption of scaling strategies that admit such limits yields better hyperparameter transfer, where optimal hyperparameters in small models remain optimal in large models. We will provide examples of these results for multi-layer perceptrons, convolutional networks, self-attention blocks, and mixture-of-experts transformers. Lastly, we will discuss simplified, analytically tractable models which enable analysis of training dynamics far from the infinite limit, providing solvable toy models of the neural scaling laws and hyperparameter transfer across training horizons.