Self-Distillation Amplifies Regularization in Hilbert Space

Hossein Mobahi (Google Research)

Live Stream

Abstract

Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another. In particular, when the architectures are identical, this is called self-distillation. The idea is to feed in predictions of the trained model as new target values for retraining (and iterate this loop possibly a few times). It has been empirically observed that the self-distilled model often achieves higher accuracy on held out data. Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training. This talk will provide a rigorous theoretical analysis of self-distillation. We focus on fitting a nonlinear function to training data, where the model space is Hilbert space and fitting is subject to L2 regularization in this function space. We show that self-distillation iterations modify regularization by progressively limiting the number of basis functions that can be used to represent the solution. This implies (as we also verify empirically) that while a few rounds of self-distillation may reduce over-fitting, further rounds may lead to under-fitting and thus worse performance.

This is joint work with Mehrdad Farajtabar (DeepMind) and Peter Bartlett (UC Berkeley).

Bio:

Hossein Mobahi is a research scientist at Google Research. His current interests relate to the theory of deep learning. Prior to joining Google in 2016, he was a postdoctoral researcher in CSAIL at MIT. He obtained his PhD in Computer Science from the University of Illinois at Urbana-Champaign (UIUC). He has co-organized the ICML 2018 Workshop “Modern Trends in Nonconvex Optimization for Machine Learning”, ICML 2019 Workshop “Understanding and Improving Generalization in Deep Learning”, and NeurIPS 2020 Competition “Predicting Generalization in Deep Learning”.

Links

seminar

12.06.25 02.10.25

Math Machine Learning seminar MPI MIS + UCLA Math Machine Learning seminar MPI MIS + UCLA

MPI for Mathematics in the Sciences Live Stream

See Details

Upcoming Events of this Seminar

Thursday, 12.06.25 Where Does Mini-Batch SGD Converge? with Pierfrancesco Beneventano
Thursday, 19.06.25 to be announced with Jingfeng Wu
Thursday, 03.07.25 to be announced with Xingyu Zhu
Thursday, 10.07.25 to be announced with Avrajit Ghosh a.o.
Thursday, 14.08.25 to be announced with Jonathan Siegel
Thursday, 02.10.25 to be announced with Marcello Carioni