Self-Distillation Amplifies Regularization in Hilbert Space
- Hossein Mobahi (Google Research)
Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another. In particular, when the architectures are identical, this is called self-distillation. The idea is to feed in predictions of the trained model as new target values for retraining (and iterate this loop possibly a few times). It has been empirically observed that the self-distilled model often achieves higher accuracy on held out data. Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training. This talk will provide a rigorous theoretical analysis of self-distillation. We focus on fitting a nonlinear function to training data, where the model space is Hilbert space and fitting is subject to L2 regularization in this function space. We show that self-distillation iterations modify regularization by progressively limiting the number of basis functions that can be used to represent the solution. This implies (as we also verify empirically) that while a few rounds of self-distillation may reduce over-fitting, further rounds may lead to under-fitting and thus worse performance.
This is joint work with Mehrdad Farajtabar (DeepMind) and Peter Bartlett (UC Berkeley).
Hossein Mobahi is a research scientist at Google Research. His current interests relate to the theory of deep learning. Prior to joining Google in 2016, he was a postdoctoral researcher in CSAIL at MIT. He obtained his PhD in Computer Science from the University of Illinois at Urbana-Champaign (UIUC). He has co-organized the ICML 2018 Workshop “Modern Trends in Nonconvex Optimization for Machine Learning”, ICML 2019 Workshop “Understanding and Improving Generalization in Deep Learning”, and NeurIPS 2020 Competition “Predicting Generalization in Deep Learning”.