Abstract for the talk on 04.09.2020 (11:00 h)Deep Learning Theory Group Seminar
Ido Nachum (École Polytechnique Fédérale de Lausanne)
On the Information Complexity of Learning
Machine learning and information theory tasks are in some sense equivalent since both involve identifying patterns and regularities in data. To recognize an elephant, a child (or a neural network) observes the repeating pattern of big ears, a trunk, and grey skin. To compress a book, a compression algorithm searches for highly repeating letters or words. So the high-level question that guided this research is:
When is learning equivalent to compression?
We use the quantity I(input; output), the mutual information between the training data and the output hypothesis of the learning algorithm, to measure the compression of the algorithm. Under this information-theoretic setting, these two notions are indeed equivalent.
a) Compression implies learning. We will show that learning algorithms that retain a small amount of information from their input sample generalize.
b) Learning implies compression. We will show that under an average-case analysis, every hypothesis class of finite VC dimension (a characterization of learnable classes) has empirical risk minimizers (ERM) that do not reveal a lot of information. If time permits, we will discuss a worst-case lower bound we proved by presenting a simple concept class for which every empirical risk minimizer (also randomized) must reveal a lot of information and also observe connections with well-studied notions such as sample compression schemes, Occam’s razor, PAC-Bayes, and differential privacy.