Deep Learning Theory Kickoff Meeting

Abstracts for the talks

Michael Arbel
Gatsby Computational Neuroscience Unit, University College London
Kernel Distances for Deep Generative Models


Generative adversarial networks (GANs) achieve state-of-the-art performance for generating high quality images.
Key to GAN performance is the critic, which learns to discriminate between real and artificially generated images. Various divergence families have been proposed for such critics, including f-divergences (the f-gan family) and integral probability metrics (the Wasserstein and MMD GANs). In recent GAN training approaches, these critic divergence measures have been learned using gradient regularisation strategies, which have contributed significantly to their success.
In this talk, we will introduce and analyze a data-adaptive gradient gradient penalty as a critic regularizer for the MMD GAN. We propose a method to constrain the gradient analytically and relate it to the weak continuity of a distributional loss functional. We also demonstrate experimentally that such a regularized functional improves on the existing state of the art methods for unsupervised image generation on CelebA and ImageNet.
Based on joint work with Dougal Sutherland, Mikołaj Bińkowski, and Arthur Gretton.

Nihat Ay
Max Planck Institute for Mathematics in the Sciences
On the Natural Gradient for Deep Learning
The natural gradient method is one of the most prominent information-geometric methods within the field of machine learning.
It was proposed by Amari in 1998 and uses the Fisher-Rao metric as Riemannian metric for the definition of a gradient within optimisation tasks. Since then it proved to be extremely efficient in the context of neural networks, reinforcement learning, and robotics. In recent years, attempts have been made to apply the natural gradient method for training deep neural networks. However, due to the huge number of parameters of such networks, the method is currently not directly applicable in this context. In my presentation, I outline ways to simplify the natural gradient for deep learning. Corresponding simplifications are related to the locality of learning associated with the underlying network structure.

Pradeep Banerjee
Max Planck Institute for Mathematics in the Sciences
The Blackwell Information Bottleneck


I will talk about a new bottleneck method for learning data representations based on channel deficiency, rather than the more traditional information sufficiency. A variational upper bound allows us to implement this method efficiently. The bound itself is bounded above by the variational information bottleneck objective, and the two methods coincide in the regime of single-shot Monte Carlo approximations. The notion of deficiency provides a principled way of approximating complicated channels by relatively simpler ones. Deficiencies have a rich heritage in the theory of comparison of statistical experiments and have an operational interpretation in terms of the optimal risk gap of decision problems. Experiments demonstrate that the deficiency bottleneck can provide advantages in terms of minimal sufficiency as measured by information bottleneck curves, while retaining a good test performance in a classification task. I will also talk about an unsupervised generalization and relation to variational autoencoders. Finally, I discuss the utility of our method in estimating a quantity called the unique information which quantifies a deviation from the Blackwell order.

(Joint work with Guido Montufar, Departments of Mathematics and Statistics, UCLA)

Eliana Duarte
Max Planck Institute for Mathematics in the Sciences
Discrete Statistical Models with Rational Maximum Likelihood Estimator


A discrete statistical model is a subset of a probability simplex. Its maximum likelihood estimator (MLE) is a retraction from that simplex onto the model. We characterize all models for which this retraction is a rational function. This is a contribution via real algebraic geometry which rests on results due to Huh and Kapranov on Horn uniformization. We present an algorithm for constructing models with rational MLE, and we demonstrate it on a range of instances. Our focus lies on models like Bayesian networks, decomposable graphical models, and staged trees.

Yonatan Dukler
UCLA, Department of Mathematics
Wasserstein of Wasserstein Loss for Learning Generative Models


In this talk we investigate the use of the Wasserstein ground metric in generative models. The Wasserstein distance serves as a loss function for unsupervised learning which depends on the choice of a ground metric on sample space. We propose to use a Wasserstein distance as the ground metric on the sample space of images. This ground metric is known as an effective distance for image retrieval, since it correlates with human perception.

We derive the Wasserstein ground metric on image space and define a Riemannian Wasserstein gradient penalty to be used in the Wasserstein Generative Adversarial Network (WGAN) framework. The new gradient penalty is computed efficiently via convolutions on the L^2 (Euclidean) gradients with negligible additional computational cost. The new formulation is more robust to the natural variability of images and provides for a more continuous discriminator in sample space.

Tim Genewein
DeepMind London
Neural Network Compression - model-capacity and parameter redundancy of neural networks


Modern deep neural networks were recently shown to have surprisingly high capacity for memorization of random labels. On the other hand it is well known in the field of neural network compression that networks trained on classification tasks with non-random labels often have significant parameter redundancy which can be effectively "compressed". Understanding this discrepancy from a theoretical viewpoint is an important open question. The aim of this talk is to introduce some modern neural network compression methods, in particular Bayesian approaches to neural network compression. The latter have some interesting theoretical properties which are also observed in practice - for instance effective capacity regularization during training, thus effectively removing the potential to fit large sets of randomly labelled data points.

Frederik Künstner
École Polytechnique Fédérale de Lausanne
Limitations of the Empirical Fisher Approximation


Natural gradient descent, which preconditions a gradient descent update with the Fisher
information matrix of the underlying statistical model, has recently received attention
as a way to capture partial second-order information. Several works have advocated an
approximation known as the empirical Fisher, drawing connections between approximate
second-order methods and heuristics like Adam. We caution against this argument by
discussing the limitations of the empirical Fisher, showing that—unlike the Fisher—
it does not generally capture second-order information. We further argue that the
conditions under which the empirical Fisher approaches the Fisher (and the Hessian)
are unlikely to be met in practice, and that the pathologies of the empirical Fisher can
have undesirable effects. This leaves open the question as to why methods based on the
empirical Fisher have been shown to outperform gradient descent in some settings. As a
step towards understanding this effect, we show that methods based on the empirical
Fisher can be interpreted as a way to adapt the descent direction to the variance of the

Wuchen Li
UCLA, Department of Mathematics
Wasserstein Information Geometry


Optimal transport (Wasserstein metric) nowadays play important roles in data science. In this talk, we brief review its development and applications in machine learning. In particular, we will focus its induced differential structure. We will introduce the Wasserstein natural gradient in parametric models. The metric tensor in probability density space is pulled back to the one on parameter space. We derive the Wasserstein gradient flows and proximal operator in parameter space. We demonstrate that the Wasserstein natural gradient works efficiently in several statistical machine learning problems, including Boltzmann machine, generative adversary models (GANs) and variational Bayesian statistics.

Luigi Malagò
Romanian Institute of Science and Technology - RIST, Cluj-Napoca
On the Information Geometry of Word Embeddings
Word embeddings are a set of techniques commonly used in natural language processing to map the words of a dictionary to a real vector space. Such mapping is commonly learned through the contexts of the words in a text corpora, by the estimation of a set of conditional probability distributions - of a context word given the central word - for each word of the dictionary. These conditional probability distributions form a Riemannian statistical manifold, where word analogies can be computed through the comparison between vectors in the tangent bundle of the manifold. In this presentation we introduce a geometric framework for the study of word embeddings in the general setting of Information Geometry, and we show how the choice of the geometry allows to define different expressions for word similarities and word analogies. The presentation is based on a joint work with Riccardo Volpi.

Grégoire Montavon
Machine Learning, Technische Universität Berlin
Explaining the Decisions of Deep Neural Networks


ML models such as deep neural networks (DNNs) are capable of producing complex real-world predictions. In order to get insight into the workings of the model and verify that the model is not overfitting the data, it is often desirable to explain its predictions. For linear and mildly nonlinear models, simple techniques based on Taylor expansions can be used, however, for highly nonlinear DNN models, the task of explanation becomes more difficult.

In this talk, we first discuss some motivations for explaining predictions, and specific challenges in producing them. We then introduce the LRP technique which explains by reverse-propagating the prediction in the network through a set of engineered propagation rules. The reverse propagation procedure can be interpreted as a ‘deep Taylor decomposition’ where the explanation is the outcome of a sequence of Taylor expansions performed at each layer of the DNN model.

Guido Montúfar
Max Planck Institute for Mathematics in the Sciences
Introduction Deep Learning Theory


Razvan Pascanu
DeepMind London
Looking at data efficiency in RL


Deep Reinforcement Learning (DRL), while providing some impressive results (e.g. on Atari, Go, etc.), is notoriously data inefficient. This is partially due to the function approximators used (deep networks) but also due to the weak learning signal (based on observing rewards).
This talk will focus on the potential role transfer learning could play in DRL for improving data efficiency. In particular the core of the talk will be centered around the different uses of the KL-regularized RL formulation explored in recent works (e.g.,,
Time permitting, I will extend the discussion to focus on some work in progress observation about learning dynamics of neural networks (particularly in RL) and how to exploit the piecewise structure of the neural network (particularly the folding of the space) for efficiently learn generative models.

Johannes Rauh
Max Planck Institute for Mathematics in the Sciences
Synergy, redundancy and unique information


New information measures are needed to analyze how information is distributed over a complex system (such as a deep neural network). In 2010, Williams and Beer presented the idea of a general information decomposition framework to organize such measures. So far, the framework is missing a generally accepted realization. The talk discusses the current status of the Williams and Beer program.

Nico Scherf
Max Planck Institute for Human Cognitive and Brain Sciences
On Open Problems for Deep Learning in Biomedical Image Analysis


Deep Learning has thoroughly transformed the field of computer vision within the past years. Many standard problems such as image restoration, segmentation or registration, that were based on quite different modelling and optimisation approaches (e.g. PDEs, Markov Random Fields, Random Forests, ...), can now be solved within the framework of Deep Neural Networks with astonishing accuracy and speed (at prediction time). One important advantage of Deep Learning is its ability to capture the often complex statistical dependencies in image data and leverage this information for improving prediction, regression, or classification, given enough annotated data.

However, in the biomedical domain, one major limitation is the scarceness of suitably annotated data that rules out a lot of solutions from the computer vision domain. Here, approaches such as manifold learning, generative models, or using deep networks as structural priors are promising directions for weakly supervised or unsupervised learning in biomedical imaging. Another important aspect, in particular for medical image analysis is the interpretability (or the lack thereof) of the fitted model.

In this talk I am going to present a selection of problems in biomedical image analysis, that would greatly benefit from Deep Learning approaches, but lack the typically required amount of annotated data. I will focus on examples from high-resolution in-vivo MRI imaging of brain structure, microscopic analysis of anatomical microstructure of the human cortex and large-scale live microscopy for stem cell biology and developmental biology.

Ingo Steinwart
Universität Stuttgart
A Sober Look at Neural Network Initializations


Initializing the weights and the biases is a key
part of the training process of a neural network.
Unlike the subsequent optimization phase, however,
the initialization phase has gained only limited
attention in the literature.
In the first part of the talk, I will discuss some
consequences of commonly used initialization strategies
for vanilla DNNs with ReLU activations. Based on
these insights I will then introduce an alternative
initialization strategy, and finally I will present
some large scale experiments assessing the quality
of the new initialization strategy.

Maurice Weiler
Machine Learning Lab, University of Amsterdam
Gauge Equivariant Convolutional Networks


The idea of equivariance to symmetry transformations provides one of the first theoretically grounded principles for neural network architecture design. Equivariant networks have shown excellent performance and data efficiency on vision and medical imaging problems that exhibit symmetries. We extend this principle beyond global symmetries to local gauge transformations, thereby enabling the development of equivariant convolutional networks on general manifolds. We show that gauge equivariant convolutional networks give a unified description of equivariant and geometric deep learning by deriving a wide range of models as special cases of our theory. To illustrate our theory on a simple example and highlight the interplay between local and global symmetries we discuss an implementation for signals defined on the icosahedron, which provides a reasonable approximation of spherical signals. We evaluate the Icosahedral CNN on omnidirectional image segmentation and climate pattern segmentation, and find that it outperforms previous methods.


Date and Location

March 27 - 29, 2019
Max Planck Institute for Mathematics in the Sciences
Inselstr. 22
04103 Leipzig

Scientific Organizers

Guido Montúfar
MPI for Mathematics in the Sciences

Administrative Contact

Valeria Hünniger
MPI für Mathematik in den Naturwissenschaften
23.03.2020, 15:25