Recent Developments Towards a New Theory of Generalisation (Seminar)
- Nihat Ay
Abstract
Statistical learning theory (SLT) is a powerful framework to study the generalisation ability of learning machines (their performance in the context of new data). The corresponding bounds depend on certain capacity measures that quantify the expressivity of the underlying architecture (how many different functions it can represent).
Recent case studies have shown that the high performance of state-of-the-art learning machines, particularly deep neural networks, cannot be explained within the framework of classical SLT. On the one hand, such systems work extremely well in over-parametrised regimes where the classical bounds, being strongly dependent on the number of parameters, become uninformative. On the other hand, some evidence suggests that convergence and generalisation should be controlled by bounds that depend on some norm of the optimal solution, rather than the capacity of the architecture. The control of this norm is believed to be strongly dependent on the optimisation algorithm at hand.
The seminar has been prepared in collaboration with Juan Pablo Vigneaux.
Part I. Introduction: Classical theory of learning and generalisation
A. Statistical learning theory
B. Capacity measures in SLT: VC-dimension, Rademacher dimension, etc.
- For A and B, see cf. Bousquet, O., Boucheron, S. and Lugosi, G., 2003, February. Introduction to statistical learning theory. In Summer School on Machine Learning (pp. 169-207). Springer, Berlin, Heidelberg.
C. Optimization: Gradient descent and Stochastic Gradient descent
D. VC-dimension of neural networks
- Bartlett, P.L. and Maass, W., 2003. Vapnik-Chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, pp.1188-1192.
Part II. Puzzles and challenges posed by recent case studies
References:
- Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O., 2016. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.
- Gunasekar, S., Woodworth, B.E., Bhojanapalli, S., Neyshabur, B. and Srebro, N., 2017. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems (pp. 6151-6159).
- Belkin, M., Hsu, D., Ma, S. and Mandal, S., 2019. Reconciling modern machine-learning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32), pp.15849-15854.
Complementary references:
- Zhang, C., Liao, Q., Rakhlin, A., Miranda, B., Golowich, N. and Poggio, T., 2018. Theory of deep learning IIb: Optimization properties of SGD. arXiv preprint arXiv:1801.02254.
- Poggio, T., Kawaguchi, K., Liao, Q., Miranda, B., Rosasco, L., Boix, X., Hidary, J. and Mhaskar, H., 2017. Theory of deep learning III: explaining the non-overfitting puzzle. arXiv preprint arXiv:1 801.00173.
And also the talks by:
- Peter Bartlett: Accurate prediction from interpolation
- Nati Srebro: Theoretical Perspectives on Deep Learning
- Mikhail Belkin: Beyond Empirical Risk Minimization: the lessons of deep learning
Part III. Theoretical perspectives and developments</part>
- Bartlett, P.L., 1998. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE transactions on Information Theory, 44(2), pp.525-536.
- Bartlett, P.L., Long, P.M., Lugosi, G. and Tsigler, A., 2019. Benign overfitting in linear regression. arXiv preprint arXiv:1906.11300.
- Gunasekar, S., Lee, J.D., Soudry, D. and Srebro, N., 2018. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems (pp. 9461-9471).
Date and time info
Thursday 15:15 - 16:45
Prerequisites
Linear algebra, elementary probability theory, basic notions from functional analysis