The increasing dimensionality of data in the modern machine learning age presents new challenges and opportunities. The high-dimensional settings allow one to use powerful asymptotic methods from probability theory and statistical physics to obtain precise asymptotic characterizations of the generalization errors and of the benefits of overparametrization. I will present and review some recent works in this direction, and discuss what they teach us in the broader context of generalization, double descent, and over-parameterization in modern machine learning problems.
Dropout regularisation for training neural networks turns out to be very successful in practical applications. The empirical explanation of this success is based on reducing co-adaptation of features during training. Moreover, practicionners observe that 'training with dropout converges not faster, but to a better local minimum'. However, there is hardly any mathematical understanding of these statements. In this talk I want to give a mathematical interpretation of the last statement, discuss a continuous time model of training with dropout and explain why it 'converges to a better local minimum' than in case of a conventional training.
Niklas Breustedt TU Braunschweig, GermanyUnrolling versus bilevel optimization in the context of learning variational models
Styliani Kampezidou Georgia Institute of Technology, USAOnline Adaptive Learning in Energy Trading Stackelberg Games with Time-Coupling Constraints
An energy trading mechanism is proposed between a selfish energy broker (aggregator) and her selfish energy customers (prosumers). The proposed design is a Stackelberg game where the aggregator trades energy bidirectionally between the prosumers and the wholesale electricity market for profit. For the purpose of satisfying the prosumers' desired energy consumption, time-coupling constraints are introduced. The described game does not admit closed form equilibrium strategies and therefore an online adaptive learning algorithm is proposed to mitigate this challenge. The latter is scalable with the number of prosumers and does not require explicit knowledge of the prosumers' decision-making mechanisms. Experimental results that utilize real-world data from the California market are provided to demonstrate the performance of the proposed approach.
Iosif Lytras University of Edinburgh, United KingdomTaming neural networks with TUSLA: Non-convex learning via adaptive stochastic gradient Langevin algorithms
Joint work with Attila Lovas, Miklós Rásonyi and Sotirios SabanisArxiv Url : https://arxiv.org/pdf/2006.14514.pdf
Johannes Müller Max Planck Institute for Mathematics in the Sciences, GermanyThe geometry of discounted stationary distributions of Markov decision processes
Oren Neumann Goethe University Frankfurt am Main, GermanyInvestment Vs. reward in competitive games
Maximilian Steffen Universität Hamburg, GermanyPAC-Bayesian Estimation for High-Dimensional Multi-Index Regression with Unknown Active Dimension
David Szeghy Eötvös Loránd University (ELTE), HungaryAdversarial Perturbation Stability of the Layered Group Basis Pursuit
Hanna Tseran MPI for Mathematics in the Sciences, GermanyOn the Expected Complexity of Maxout Networks
Jesse van Oostrum Hamburg University of Technology, GermanyParametrisation Independence of the Natural Gradient in Overparametrised Systems
Csongor-Huba Varady Max Planck Institute for Mathematics in the Sciences (MiS) in Leipzig., GermanyNatural Reweighted Wake Sleep for Convolutional Networks
Xiaoyu Wang University of Cambridge, United KingdomLifted Bregman Networks
Chia Zargeh University of Sao Paulo, BrazilApplications of associative algebras in machine learning
We provide an Information-Geometric formulation of Classical Mechanics on the Riemannian manifold of probability distributions, which is an affine manifold endowed with a dually-flat connection. In a non-parametric formalism, we consider the full set of positive probability functions on a finite sample space, and we provide a specific expression for the tangent and cotangent spaces over the statistical manifold, in terms of a Hilbert bundle structure that we call the Statistical Bundle. In this setting, we compute velocities and accelerations of a one-dimensional statistical model using the canonical dual pair of parallel transports and define a coherent formalism for Lagrangian and Hamiltonian mechanics on the bundle. Finally, in a series of examples, we show how our formalism provides a consistent framework for accelerated natural gradient dynamics on the probability simplex, paving the way for direct applications in optimization, game theory and neural networks. The work in based on joint collaboration with Goffredo Chirco and Giovanni Pistone https://arxiv.org/abs/2009.09431
We investigate gradient descent training of wide neural networks and the corresponding implicit bias in function space. For univariate regression, the solution of training a width-n shallow ReLU network is within n^−1/2 of the function which fits the training data and whose difference from the initial function has the smallest 2-norm of the second derivative weighted by a curvature penalty that depends on the probability distribution that is used to initialize the network parameters. We compute the curvature penalty function explicitly for various common initialization procedures. For multivariate regression we show an analogous result, whereby the second derivative is replaced by the Radon transform of a fractional Laplacian. For initialization schemes that yield a constant penalty function, the solutions are polyharmonic splines. Moreover, we obtain results for different activations and show that the training trajectories are captured by trajectories of smoothing splines with decreasing regularization strength. This is joint work with Hui Jin https://arxiv.org/abs/2006.07356.