# Deep Learning Theory Kickoff Meeting

## Abstracts for the talks

**Michael Arbel ***Gatsby Computational Neuroscience Unit, University College London***Kernel Distances for Deep Generative Models**

Generative adversarial networks (GANs) achieve state-of-the-art performance for generating high quality images.

Key to GAN performance is the critic, which learns to discriminate between real and artificially generated images. Various divergence families have been proposed for such critics, including f-divergences (the f-gan family) and integral probability metrics (the Wasserstein and MMD GANs). In recent GAN training approaches, these critic divergence measures have been learned using gradient regularisation strategies, which have contributed significantly to their success.

In this talk, we will introduce and analyze a data-adaptive gradient gradient penalty as a critic regularizer for the MMD GAN. We propose a method to constrain the gradient analytically and relate it to the weak continuity of a distributional loss functional. We also demonstrate experimentally that such a regularized functional improves on the existing state of the art methods for unsupervised image generation on CelebA and ImageNet.

Based on joint work with Dougal Sutherland, Mikołaj Bińkowski, and Arthur Gretton.

**Nihat Ay ***Max Planck Institute for Mathematics in the Sciences***On the Natural Gradient for Deep Learning**

The natural gradient method is one of the most prominent information-geometric methods within the field of machine learning.

It was proposed by Amari in 1998 and uses the Fisher-Rao metric as Riemannian metric for the definition of a gradient within optimisation tasks. Since then it proved to be extremely efficient in the context of neural networks, reinforcement learning, and robotics. In recent years, attempts have been made to apply the natural gradient method for training deep neural networks. However, due to the huge number of parameters of such networks, the method is currently not directly applicable in this context. In my presentation, I outline ways to simplify the natural gradient for deep learning. Corresponding simplifications are related to the locality of learning associated with the underlying network structure.

**Pradeep Banerjee ***Max Planck Institute for Mathematics in the Sciences***The Blackwell Information Bottleneck**

I will talk about a new bottleneck method for learning data representations based on channel deficiency, rather than the more traditional information sufficiency. A variational upper bound allows us to implement this method efficiently. The bound itself is bounded above by the variational information bottleneck objective, and the two methods coincide in the regime of single-shot Monte Carlo approximations. The notion of deficiency provides a principled way of approximating complicated channels by relatively simpler ones. Deficiencies have a rich heritage in the theory of comparison of statistical experiments and have an operational interpretation in terms of the optimal risk gap of decision problems. Experiments demonstrate that the deficiency bottleneck can provide advantages in terms of minimal sufficiency as measured by information bottleneck curves, while retaining a good test performance in a classification task. I will also talk about an unsupervised generalization and relation to variational autoencoders. Finally, I discuss the utility of our method in estimating a quantity called the unique information which quantifies a deviation from the Blackwell order.

(Joint work with Guido Montufar, Departments of Mathematics and Statistics, UCLA)

**Eliana Duarte ***Max Planck Institute for Mathematics in the Sciences***Discrete Statistical Models with Rational Maximum Likelihood Estimator**

A discrete statistical model is a subset of a probability simplex. Its maximum likelihood estimator (MLE) is a retraction from that simplex onto the model. We characterize all models for which this retraction is a rational function. This is a contribution via real algebraic geometry which rests on results due to Huh and Kapranov on Horn uniformization. We present an algorithm for constructing models with rational MLE, and we demonstrate it on a range of instances. Our focus lies on models like Bayesian networks, decomposable graphical models, and staged trees.

**Yonatan Dukler ***UCLA, Department of Mathematics***Wasserstein of Wasserstein Loss for Learning Generative Models**

In this talk we investigate the use of the Wasserstein ground metric in generative models. The Wasserstein distance serves as a loss function for unsupervised learning which depends on the choice of a ground metric on sample space. We propose to use a Wasserstein distance as the ground metric on the sample space of images. This ground metric is known as an effective distance for image retrieval, since it correlates with human perception.

We derive the Wasserstein ground metric on image space and define a Riemannian Wasserstein gradient penalty to be used in the Wasserstein Generative Adversarial Network (WGAN) framework. The new gradient penalty is computed efficiently via convolutions on the L^2 (Euclidean) gradients with negligible additional computational cost. The new formulation is more robust to the natural variability of images and provides for a more continuous discriminator in sample space.

**Tim Genewein ***DeepMind London***Neural Network Compression - model-capacity and parameter redundancy of neural networks**

Modern deep neural networks were recently shown to have surprisingly high capacity for memorization of random labels. On the other hand it is well known in the field of neural network compression that networks trained on classification tasks with non-random labels often have significant parameter redundancy which can be effectively "compressed". Understanding this discrepancy from a theoretical viewpoint is an important open question. The aim of this talk is to introduce some modern neural network compression methods, in particular Bayesian approaches to neural network compression. The latter have some interesting theoretical properties which are also observed in practice - for instance effective capacity regularization during training, thus effectively removing the potential to fit large sets of randomly labelled data points.

**Frederik Künstner ***École Polytechnique Fédérale de Lausanne***Limitations of the Empirical Fisher Approximation**

Natural gradient descent, which preconditions a gradient descent update with the Fisher

information matrix of the underlying statistical model, has recently received attention

as a way to capture partial second-order information. Several works have advocated an

approximation known as the empirical Fisher, drawing connections between approximate

second-order methods and heuristics like Adam. We caution against this argument by

discussing the limitations of the empirical Fisher, showing that—unlike the Fisher—

it does not generally capture second-order information. We further argue that the

conditions under which the empirical Fisher approaches the Fisher (and the Hessian)

are unlikely to be met in practice, and that the pathologies of the empirical Fisher can

have undesirable effects. This leaves open the question as to why methods based on the

empirical Fisher have been shown to outperform gradient descent in some settings. As a

step towards understanding this effect, we show that methods based on the empirical

Fisher can be interpreted as a way to adapt the descent direction to the variance of the

gradients.

**Wuchen Li ***UCLA, Department of Mathematics***Wasserstein Information Geometry**

Optimal transport (Wasserstein metric) nowadays play important roles in data science. In this talk, we brief review its development and applications in machine learning. In particular, we will focus its induced differential structure. We will introduce the Wasserstein natural gradient in parametric models. The metric tensor in probability density space is pulled back to the one on parameter space. We derive the Wasserstein gradient flows and proximal operator in parameter space. We demonstrate that the Wasserstein natural gradient works efficiently in several statistical machine learning problems, including Boltzmann machine, generative adversary models (GANs) and variational Bayesian statistics.

**Luigi Malagò ***Romanian Institute of Science and Technology - RIST, Cluj-Napoca***On the Information Geometry of Word Embeddings**

Word embeddings are a set of techniques commonly used in natural language processing to map the words of a dictionary to a real vector space. Such mapping is commonly learned through the contexts of the words in a text corpora, by the estimation of a set of conditional probability distributions - of a context word given the central word - for each word of the dictionary. These conditional probability distributions form a Riemannian statistical manifold, where word analogies can be computed through the comparison between vectors in the tangent bundle of the manifold. In this presentation we introduce a geometric framework for the study of word embeddings in the general setting of Information Geometry, and we show how the choice of the geometry allows to define different expressions for word similarities and word analogies. The presentation is based on a joint work with Riccardo Volpi.

**Grégoire Montavon ***Machine Learning, Technische Universität Berlin***Explaining the Decisions of Deep Neural Networks**

ML models such as deep neural networks (DNNs) are capable of producing complex real-world predictions. In order to get insight into the workings of the model and verify that the model is not overfitting the data, it is often desirable to explain its predictions. For linear and mildly nonlinear models, simple techniques based on Taylor expansions can be used, however, for highly nonlinear DNN models, the task of explanation becomes more difficult.

In this talk, we first discuss some motivations for explaining predictions, and specific challenges in producing them. We then introduce the LRP technique which explains by reverse-propagating the prediction in the network through a set of engineered propagation rules. The reverse propagation procedure can be interpreted as a ‘deep Taylor decomposition’ where the explanation is the outcome of a sequence of Taylor expansions performed at each layer of the DNN model.

**Guido Montúfar ***Max Planck Institute for Mathematics in the Sciences***Introduction Deep Learning Theory **

**Razvan Pascanu ***DeepMind London***Looking at data efficiency in RL**

Deep Reinforcement Learning (DRL), while providing some impressive results (e.g. on Atari, Go, etc.), is notoriously data inefficient. This is partially due to the function approximators used (deep networks) but also due to the weak learning signal (based on observing rewards).

This talk will focus on the potential role transfer learning could play in DRL for improving data efficiency. In particular the core of the talk will be centered around the different uses of the KL-regularized RL formulation explored in recent works (e.g. arxiv.org/abs/1707.04175, arxiv.org/abs/1806.01780, openreview.net/forum.

Time permitting, I will extend the discussion to focus on some work in progress observation about learning dynamics of neural networks (particularly in RL) and how to exploit the piecewise structure of the neural network (particularly the folding of the space) for efficiently learn generative models.

**Johannes Rauh ***Max Planck Institute for Mathematics in the Sciences***Synergy, redundancy and unique information**

New information measures are needed to analyze how information is distributed over a complex system (such as a deep neural network). In 2010, Williams and Beer presented the idea of a general information decomposition framework to organize such measures. So far, the framework is missing a generally accepted realization. The talk discusses the current status of the Williams and Beer program.

**Nico Scherf ***Max Planck Institute for Human Cognitive and Brain Sciences***On Open Problems for Deep Learning in Biomedical Image Analysis**

Deep Learning has thoroughly transformed the field of computer vision within the past years. Many standard problems such as image restoration, segmentation or registration, that were based on quite different modelling and optimisation approaches (e.g. PDEs, Markov Random Fields, Random Forests, ...), can now be solved within the framework of Deep Neural Networks with astonishing accuracy and speed (at prediction time). One important advantage of Deep Learning is its ability to capture the often complex statistical dependencies in image data and leverage this information for improving prediction, regression, or classification, given enough annotated data.

However, in the biomedical domain, one major limitation is the scarceness of suitably annotated data that rules out a lot of solutions from the computer vision domain. Here, approaches such as manifold learning, generative models, or using deep networks as structural priors are promising directions for weakly supervised or unsupervised learning in biomedical imaging. Another important aspect, in particular for medical image analysis is the interpretability (or the lack thereof) of the fitted model.

In this talk I am going to present a selection of problems in biomedical image analysis, that would greatly benefit from Deep Learning approaches, but lack the typically required amount of annotated data. I will focus on examples from high-resolution in-vivo MRI imaging of brain structure, microscopic analysis of anatomical microstructure of the human cortex and large-scale live microscopy for stem cell biology and developmental biology.

**Ingo Steinwart ***Universität Stuttgart***A Sober Look at Neural Network Initializations **

Initializing the weights and the biases is a key

part of the training process of a neural network.

Unlike the subsequent optimization phase, however,

the initialization phase has gained only limited

attention in the literature.

In the first part of the talk, I will discuss some

consequences of commonly used initialization strategies

for vanilla DNNs with ReLU activations. Based on

these insights I will then introduce an alternative

initialization strategy, and finally I will present

some large scale experiments assessing the quality

of the new initialization strategy.

**Maurice Weiler ***Machine Learning Lab, University of Amsterdam***Gauge Equivariant Convolutional Networks**

The idea of equivariance to symmetry transformations provides one of the first theoretically grounded principles for neural network architecture design. Equivariant networks have shown excellent performance and data efficiency on vision and medical imaging problems that exhibit symmetries. We extend this principle beyond global symmetries to local gauge transformations, thereby enabling the development of equivariant convolutional networks on general manifolds. We show that gauge equivariant convolutional networks give a unified description of equivariant and geometric deep learning by deriving a wide range of models as special cases of our theory. To illustrate our theory on a simple example and highlight the interplay between local and global symmetries we discuss an implementation for signals defined on the icosahedron, which provides a reasonable approximation of spherical signals. We evaluate the Icosahedral CNN on omnidirectional image segmentation and climate pattern segmentation, and find that it outperforms previous methods.

## Date and Location

**March 27 - 29, 2019**

Max Planck Institute for Mathematics in the Sciences

Inselstr. 22

04103 Leipzig

## Scientific Organizers

**Guido Montúfar**

MPI for Mathematics in the Sciences

## Administrative Contact

**Valeria Hünniger**

MPI für Mathematik in den Naturwissenschaften