Structure of Learning Tasks and the Information in the Weights of a Deep Network
- Alessandro Achille (University of California, Los Angeles)
What are the fundamental quantities to understand the learning process of a deep neural network? Why are some datasets easier than others? What does it mean for two tasks to have a similar structure? We argue that information theoretic quantities, and in particular the amount of information that SGD stores in the weights, can be used to characterize the training process of a deep network. In fact, we show that the information in the weights bounds the generalization error and the invariance of the learned representation. It also allows us to connect the learning dynamics with the "structure function" of the dataset, and to define a notion of distance between tasks, which relates to fine-tuning. The non-trivial dynamics of information during training give rise to phenomena, such as critical periods for learning, that closely mimic those observed in humans and may suggest that forgetting information about the training data is a necessary part of the learning process.