This meeting aims to discuss mathematical topics in machine learning and deep learning and to kickoff the ERC project Deep Learning Theory at MPI MIS.

Deep Learning Theory: Geometric Analysis of Capacity, Optimization, and Generalization for Improving Learning in Deep Neural Networks.

Deep Learning is one of the most vibrant areas of contemporary machine learning and one of the most promising approaches to Artificial Intelligence. Deep Learning drives the latest systems for image, text, and audio processing, as well as an increasing number of new technologies. The goal of this project is to advance on key open problems in Deep Learning, specifically regarding the capacity, optimization, and regularization of these algorithms. The idea is to consolidate a theoretical basis that allows us to pin down the inner workings of the present success of Deep Learning and make it more widely applicable, in particular in situations with limited data and challenging problems in reinforcement learning. The approach is based on the geometry of neural networks and exploits innovative mathematics, drawing on information geometry and algebraic statistics. This is a quite timely and unique proposal which holds promise to vastly streamline the progress of Deep Learning into new frontiers.

The natural gradient method is one of the most prominent information-geometric methods within the field of machine learning.
It was proposed by Amari in 1998 and uses the Fisher-Rao metric as Riemannian metric for the definition of a gradient within optimisation tasks. Since then it proved to be extremely efficient in the context of neural networks, reinforcement learning, and robotics. In recent years, attempts have been made to apply the natural gradient method for training deep neural networks. However, due to the huge number of parameters of such networks, the method is currently not directly applicable in this context. In my presentation, I outline ways to simplify the natural gradient for deep learning. Corresponding simplifications are related to the locality of learning associated with the underlying network structure.

Modern deep neural networks were recently shown to have surprisingly high capacity for memorization of random labels. On the other hand it is well known in the field of neural network compression that networks trained on classification tasks with non-random labels often have significant parameter redundancy which can be effectively "compressed". Understanding this discrepancy from a theoretical viewpoint is an important open question. The aim of this talk is to introduce some modern neural network compression methods, in particular Bayesian approaches to neural network compression. The latter have some interesting theoretical properties which are also observed in practice - for instance effective capacity regularization during training, thus effectively removing the potential to fit large sets of randomly labelled data points.

Initializing the weights and the biases is a key part of the training process of a neural network. Unlike the subsequent optimization phase, however, the initialization phase has gained only limited attention in the literature.
In the first part of the talk, I will discuss some consequences of commonly used initialization strategies for vanilla DNNs with ReLU activations. Based on these insights I will then introduce an alternative initialization strategy, and finally I will present some large scale experiments assessing the quality of the new initialization strategy.

ML models such as deep neural networks (DNNs) are capable of producing complex real-world predictions. In order to get insight into the workings of the model and verify that the model is not overfitting the data, it is often desirable to explain its predictions. For linear and mildly nonlinear models, simple techniques based on Taylor expansions can be used, however, for highly nonlinear DNN models, the task of explanation becomes more difficult.
In this talk, we first discuss some motivations for explaining predictions, and specific challenges in producing them. We then introduce the LRP technique which explains by reverse-propagating the prediction in the network through a set of engineered propagation rules. The reverse propagation procedure can be interpreted as a ‘deep Taylor decomposition’ where the explanation is the outcome of a sequence of Taylor expansions performed at each layer of the DNN model.

Natural gradient descent, which preconditions a gradient descent update with the Fisher information matrix of the underlying statistical model, has recently received attention as a way to capture partial second-order information. Several works have advocated an approximation known as the empirical Fisher, drawing connections between approximate second-order methods and heuristics like Adam. We caution against this argument by discussing the limitations of the empirical Fisher, showing that—unlike the Fisher— it does not generally capture second-order information. We further argue that the conditions under which the empirical Fisher approaches the Fisher (and the Hessian) are unlikely to be met in practice, and that the pathologies of the empirical Fisher can have undesirable effects. This leaves open the question as to why methods based on the empirical Fisher have been shown to outperform gradient descent in some settings. As a step towards understanding this effect, we show that methods based on the empirical Fisher can be interpreted as a way to adapt the descent direction to the variance of the gradients.

3264 Conics in a SecondPaul BreidingMax Planck Institute for Mathematics in the SciencesJoint work: Paul Breiding, Bernd Sturmfels and Sascha Timme.
In 1848 Jakob Steiner asked "How many conics are tangent to five conics?"
In 2019 we ask "Which conics are tangent to your five conics?"
The answer is at juliahomotopycontinuation.org/do-it-yourself/Reconstructions by Variational AutoEncoders as a Defense Strategy against Adversarial ExamplesPetru HlihorRomanian Institute of Science and TechnologyAdversarial Examples for classification tasks are inputs provided to a machine learning model, specifically designed to produce a wrong classification. They are usually obtained by malicious perturbations of a sample in a dataset, which are difficult to be recognized even by a human. In this poster we study the use of Variational AutoEncoders to preprocess images before classification, as a strategy to defend against adversarial examples. In our preliminary experiments we show that by reconstructing images with a Variational AutoEncoder, the accuracy of the classifier improves significantly, even against some of the most powerful attacks in the literature. As opposed to regular autoencoders, previously proposed in the literature as a defense mechanism, the presence of a stochastic layer plays a key role in the defense, which is not trivial to be circumvented by an attacker.Multiagent Deep Reinforcement Learning for Market MakingPankaj KumarCopenhagen Business SchoolMarket Making is high frequency trading strategy in which an agent provides liquidity simultaneously quoting a bid (buy) price and an ask (sell) price on an asset. Market Makers reaps profits in the form of the spread between the quoted price placed on the buy and sell prices. Due to complexity in inventory risk, counterparties to trades and information asymmetry, understating of market making algorithms is relatively unexplored by academicians. Quite a few body of literature, in particular in single deep reinforcement learning (DRL), has studied the problem of optimal execution and prediction market. The success of such single DRL’s can be accredited to the use of experience replay memories, which legitimate Deep Q-Networks (DQNs) to be trained efficiently through sampling stored state transitions. However, outmost care is required in multi-agent deep reinforcement learning (MA-DRL), as stored transitions can become obsolete when agents update their policies in parallel Motivated by above, in this talk, I will introduce a novel reformulation of the multi-agent deep reinforcement learning (MA-DRL) simulation framework for market making, which allows many agents interactions without fail. Using simple state reformulation of multi-agent like image, innovative multi-agent training and agent ambiguity, convolution neural network for the Q-value function approximation is used to learn distributed multi-agent policies. This approach alleviates convergence, non-stationarity training, and scalability issues encountered in the literature for multi-agent systems. Also, the market maker agents successfully reproduce stylized facts in historical trade data from each simulation.Interventional Markov Equivalence for Mixed Graph ModelsLiam SolusKTH Royal Institute of TechnologyWe will discuss the problem of characterizing Markov equivalence of graphical models under general interventions.
Recently, Yang et al. (2018) gave a graphical characterization of interventional Markov equivalence for DAG models that relates to the global Markov properties of DAGs. Based on this, we extend the notion of interventional Markov equivalence using global Markov properties of loopless mixed graphs and generalize their graphical characterization to ancestral graphs. On the other hand, we also extend the notion of interventional Markov equivalence via modifications of factors of distributions Markov to acyclic directed mixed graphs. We prove these two generalizations coincide at their intersection; i.e., for directed ancestral graphs. This yields a graphical characterization of interventional Markov equivalence for causal models that incorporate latent confounders and selection variables under assumptions on the intervention targets that are reasonable for biological applications.Learning Latent Representations for Audio Signals through Variational AutoencodersCsongor-Huba VaradyRomanian Institute of Science and TechnologyIn this short paper we are interested in exploring generative models for audio signals, with a particular focus on signal reconstruction and learning explainable latent representations. Similarly to the work of Engel et al., we consider generative models characterized by a Wavenet decoder, which produce in output an autoregressive model conditioned on the past signal as well as on the latent representation. The main contribution of our work consists in the proposal of an architecture based on Variational AutoEncoders, which allow us to define an approximate posterior able to explicitly capture the time dependence of the latent encoding over time. Moreover, the possibility to introduce variational bounds for the training of the model could possibly lead to disentangled representations for audio signals, and thus the learning of latent encoding easier to be interpreted.

Deep Reinforcement Learning (DRL), while providing some impressive results (e.g. on Atari, Go, etc.), is notoriously data inefficient. This is partially due to the function approximators used (deep networks) but also due to the weak learning signal (based on observing rewards).
This talk will focus on the potential role transfer learning could play in DRL for improving data efficiency. In particular the core of the talk will be centered around the different uses of the KL-regularized RL formulation explored in recent works (e.g. https://arxiv.org/abs/1707.04175, https://arxiv.org/abs/1806.01780, https://openreview.net/forum?id=S1lqMn05Ym).
Time permitting, I will extend the discussion to focus on some work in progress observation about learning dynamics of neural networks (particularly in RL) and how to exploit the piecewise structure of the neural network (particularly the folding of the space) for efficiently learn generative models.

The idea of equivariance to symmetry transformations provides one of the first theoretically grounded principles for neural network architecture design. Equivariant networks have shown excellent performance and data efficiency on vision and medical imaging problems that exhibit symmetries. We extend this principle beyond global symmetries to local gauge transformations, thereby enabling the development of equivariant convolutional networks on general manifolds. We show that gauge equivariant convolutional networks give a unified description of equivariant and geometric deep learning by deriving a wide range of models as special cases of our theory. To illustrate our theory on a simple example and highlight the interplay between local and global symmetries we discuss an implementation for signals defined on the icosahedron, which provides a reasonable approximation of spherical signals. We evaluate the Icosahedral CNN on omnidirectional image segmentation and climate pattern segmentation, and find that it outperforms previous methods.

New information measures are needed to analyze how information is distributed over a complex system (such as a deep neural network). In 2010, Williams and Beer presented the idea of a general information decomposition framework to organize such measures. So far, the framework is missing a generally accepted realization. The talk discusses the current status of the Williams and Beer program.

Generative adversarial networks (GANs) achieve state-of-the-art performance for generating high quality images.
Key to GAN performance is the critic, which learns to discriminate between real and artificially generated images. Various divergence families have been proposed for such critics, including f-divergences (the f-gan family) and integral probability metrics (the Wasserstein and MMD GANs). In recent GAN training approaches, these critic divergence measures have been learned using gradient regularisation strategies, which have contributed significantly to their success.
In this talk, we will introduce and analyze a data-adaptive gradient gradient penalty as a critic regularizer for the MMD GAN. We propose a method to constrain the gradient analytically and relate it to the weak continuity of a distributional loss functional. We also demonstrate experimentally that such a regularized functional improves on the existing state of the art methods for unsupervised image generation on CelebA and ImageNet.
Based on joint work with Dougal Sutherland, Mikołaj Bińkowski, and Arthur Gretton.

3264 Conics in a SecondPaul BreidingMax Planck Institute for Mathematics in the SciencesJoint work: Paul Breiding, Bernd Sturmfels and Sascha Timme.
In 1848 Jakob Steiner asked "How many conics are tangent to five conics?"
In 2019 we ask "Which conics are tangent to your five conics?"
The answer is at juliahomotopycontinuation.org/do-it-yourself/Reconstructions by Variational AutoEncoders as a Defense Strategy against Adversarial ExamplesPetru HlihorRomanian Institute of Science and TechnologyAdversarial Examples for classification tasks are inputs provided to a machine learning model, specifically designed to produce a wrong classification. They are usually obtained by malicious perturbations of a sample in a dataset, which are difficult to be recognized even by a human. In this poster we study the use of Variational AutoEncoders to preprocess images before classification, as a strategy to defend against adversarial examples. In our preliminary experiments we show that by reconstructing images with a Variational AutoEncoder, the accuracy of the classifier improves significantly, even against some of the most powerful attacks in the literature. As opposed to regular autoencoders, previously proposed in the literature as a defense mechanism, the presence of a stochastic layer plays a key role in the defense, which is not trivial to be circumvented by an attacker.Multiagent Deep Reinforcement Learning for Market MakingPankaj KumarCopenhagen Business SchoolMarket Making is high frequency trading strategy in which an agent provides liquidity simultaneously quoting a bid (buy) price and an ask (sell) price on an asset. Market Makers reaps profits in the form of the spread between the quoted price placed on the buy and sell prices. Due to complexity in inventory risk, counterparties to trades and information asymmetry, understating of market making algorithms is relatively unexplored by academicians. Quite a few body of literature, in particular in single deep reinforcement learning (DRL), has studied the problem of optimal execution and prediction market. The success of such single DRL’s can be accredited to the use of experience replay memories, which legitimate Deep Q-Networks (DQNs) to be trained efficiently through sampling stored state transitions. However, outmost care is required in multi-agent deep reinforcement learning (MA-DRL), as stored transitions can become obsolete when agents update their policies in parallel Motivated by above, in this talk, I will introduce a novel reformulation of the multi-agent deep reinforcement learning (MA-DRL) simulation framework for market making, which allows many agents interactions without fail. Using simple state reformulation of multi-agent like image, innovative multi-agent training and agent ambiguity, convolution neural network for the Q-value function approximation is used to learn distributed multi-agent policies. This approach alleviates convergence, non-stationarity training, and scalability issues encountered in the literature for multi-agent systems. Also, the market maker agents successfully reproduce stylized facts in historical trade data from each simulation.Interventional Markov Equivalence for Mixed Graph ModelsLiam SolusKTH Royal Institute of TechnologyWe will discuss the problem of characterizing Markov equivalence of graphical models under general interventions.
Recently, Yang et al. (2018) gave a graphical characterization of interventional Markov equivalence for DAG models that relates to the global Markov properties of DAGs. Based on this, we extend the notion of interventional Markov equivalence using global Markov properties of loopless mixed graphs and generalize their graphical characterization to ancestral graphs. On the other hand, we also extend the notion of interventional Markov equivalence via modifications of factors of distributions Markov to acyclic directed mixed graphs. We prove these two generalizations coincide at their intersection; i.e., for directed ancestral graphs. This yields a graphical characterization of interventional Markov equivalence for causal models that incorporate latent confounders and selection variables under assumptions on the intervention targets that are reasonable for biological applications.Learning Latent Representations for Audio Signals through Variational AutoencodersCsongor-Huba VaradyRomanian Institute of Science and TechnologyIn this short paper we are interested in exploring generative models for audio signals, with a particular focus on signal reconstruction and learning explainable latent representations. Similarly to the work of Engel et al., we consider generative models characterized by a Wavenet decoder, which produce in output an autoregressive model conditioned on the past signal as well as on the latent representation. The main contribution of our work consists in the proposal of an architecture based on Variational AutoEncoders, which allow us to define an approximate posterior able to explicitly capture the time dependence of the latent encoding over time. Moreover, the possibility to introduce variational bounds for the training of the model could possibly lead to disentangled representations for audio signals, and thus the learning of latent encoding easier to be interpreted.

In this talk we investigate the use of the Wasserstein ground metric in generative models. The Wasserstein distance serves as a loss function for unsupervised learning which depends on the choice of a ground metric on sample space. We propose to use a Wasserstein distance as the ground metric on the sample space of images. This ground metric is known as an effective distance for image retrieval, since it correlates with human perception.
We derive the Wasserstein ground metric on image space and define a Riemannian Wasserstein gradient penalty to be used in the Wasserstein Generative Adversarial Network (WGAN) framework. The new gradient penalty is computed efficiently via convolutions on the L^2 (Euclidean) gradients with negligible additional computational cost. The new formulation is more robust to the natural variability of images and provides for a more continuous discriminator in sample space.

3264 Conics in a SecondPaul BreidingMax Planck Institute for Mathematics in the SciencesJoint work: Paul Breiding, Bernd Sturmfels and Sascha Timme.
In 1848 Jakob Steiner asked "How many conics are tangent to five conics?"
In 2019 we ask "Which conics are tangent to your five conics?"
The answer is at juliahomotopycontinuation.org/do-it-yourself/Reconstructions by Variational AutoEncoders as a Defense Strategy against Adversarial ExamplesPetru HlihorRomanian Institute of Science and TechnologyAdversarial Examples for classification tasks are inputs provided to a machine learning model, specifically designed to produce a wrong classification. They are usually obtained by malicious perturbations of a sample in a dataset, which are difficult to be recognized even by a human. In this poster we study the use of Variational AutoEncoders to preprocess images before classification, as a strategy to defend against adversarial examples. In our preliminary experiments we show that by reconstructing images with a Variational AutoEncoder, the accuracy of the classifier improves significantly, even against some of the most powerful attacks in the literature. As opposed to regular autoencoders, previously proposed in the literature as a defense mechanism, the presence of a stochastic layer plays a key role in the defense, which is not trivial to be circumvented by an attacker.Multiagent Deep Reinforcement Learning for Market MakingPankaj KumarCopenhagen Business SchoolMarket Making is high frequency trading strategy in which an agent provides liquidity simultaneously quoting a bid (buy) price and an ask (sell) price on an asset. Market Makers reaps profits in the form of the spread between the quoted price placed on the buy and sell prices. Due to complexity in inventory risk, counterparties to trades and information asymmetry, understating of market making algorithms is relatively unexplored by academicians. Quite a few body of literature, in particular in single deep reinforcement learning (DRL), has studied the problem of optimal execution and prediction market. The success of such single DRL’s can be accredited to the use of experience replay memories, which legitimate Deep Q-Networks (DQNs) to be trained efficiently through sampling stored state transitions. However, outmost care is required in multi-agent deep reinforcement learning (MA-DRL), as stored transitions can become obsolete when agents update their policies in parallel Motivated by above, in this talk, I will introduce a novel reformulation of the multi-agent deep reinforcement learning (MA-DRL) simulation framework for market making, which allows many agents interactions without fail. Using simple state reformulation of multi-agent like image, innovative multi-agent training and agent ambiguity, convolution neural network for the Q-value function approximation is used to learn distributed multi-agent policies. This approach alleviates convergence, non-stationarity training, and scalability issues encountered in the literature for multi-agent systems. Also, the market maker agents successfully reproduce stylized facts in historical trade data from each simulation.Interventional Markov Equivalence for Mixed Graph ModelsLiam SolusKTH Royal Institute of TechnologyWe will discuss the problem of characterizing Markov equivalence of graphical models under general interventions.
Recently, Yang et al. (2018) gave a graphical characterization of interventional Markov equivalence for DAG models that relates to the global Markov properties of DAGs. Based on this, we extend the notion of interventional Markov equivalence using global Markov properties of loopless mixed graphs and generalize their graphical characterization to ancestral graphs. On the other hand, we also extend the notion of interventional Markov equivalence via modifications of factors of distributions Markov to acyclic directed mixed graphs. We prove these two generalizations coincide at their intersection; i.e., for directed ancestral graphs. This yields a graphical characterization of interventional Markov equivalence for causal models that incorporate latent confounders and selection variables under assumptions on the intervention targets that are reasonable for biological applications.Learning Latent Representations for Audio Signals through Variational AutoencodersCsongor-Huba VaradyRomanian Institute of Science and TechnologyIn this short paper we are interested in exploring generative models for audio signals, with a particular focus on signal reconstruction and learning explainable latent representations. Similarly to the work of Engel et al., we consider generative models characterized by a Wavenet decoder, which produce in output an autoregressive model conditioned on the past signal as well as on the latent representation. The main contribution of our work consists in the proposal of an architecture based on Variational AutoEncoders, which allow us to define an approximate posterior able to explicitly capture the time dependence of the latent encoding over time. Moreover, the possibility to introduce variational bounds for the training of the model could possibly lead to disentangled representations for audio signals, and thus the learning of latent encoding easier to be interpreted.

Deep Learning has thoroughly transformed the field of computer vision within the past years. Many standard problems such as image restoration, segmentation or registration, that were based on quite different modelling and optimisation approaches (e.g. PDEs, Markov Random Fields, Random Forests, ...), can now be solved within the framework of Deep Neural Networks with astonishing accuracy and speed (at prediction time). One important advantage of Deep Learning is its ability to capture the often complex statistical dependencies in image data and leverage this information for improving prediction, regression, or classification, given enough annotated data.
However, in the biomedical domain, one major limitation is the scarceness of suitably annotated data that rules out a lot of solutions from the computer vision domain. Here, approaches such as manifold learning, generative models, or using deep networks as structural priors are promising directions for weakly supervised or unsupervised learning in biomedical imaging. Another important aspect, in particular for medical image analysis is the interpretability (or the lack thereof) of the fitted model.
In this talk I am going to present a selection of problems in biomedical image analysis, that would greatly benefit from Deep Learning approaches, but lack the typically required amount of annotated data. I will focus on examples from high-resolution in-vivo MRI imaging of brain structure, microscopic analysis of anatomical microstructure of the human cortex and large-scale live microscopy for stem cell biology and developmental biology.

Optimal transport (Wasserstein metric) nowadays play important roles in data science. In this talk, we brief review its development and applications in machine learning. In particular, we will focus its induced differential structure. We will introduce the Wasserstein natural gradient in parametric models. The metric tensor in probability density space is pulled back to the one on parameter space. We derive the Wasserstein gradient flows and proximal operator in parameter space. We demonstrate that the Wasserstein natural gradient works efficiently in several statistical machine learning problems, including Boltzmann machine, generative adversary models (GANs) and variational Bayesian statistics.

A discrete statistical model is a subset of a probability simplex. Its maximum likelihood estimator (MLE) is a retraction from that simplex onto the model. We characterize all models for which this retraction is a rational function. This is a contribution via real algebraic geometry which rests on results due to Huh and Kapranov on Horn uniformization. We present an algorithm for constructing models with rational MLE, and we demonstrate it on a range of instances. Our focus lies on models like Bayesian networks, decomposable graphical models, and staged trees.

I will talk about a new bottleneck method for learning data representations based on channel deficiency, rather than the more traditional information sufficiency. A variational upper bound allows us to implement this method efficiently. The bound itself is bounded above by the variational information bottleneck objective, and the two methods coincide in the regime of single-shot Monte Carlo approximations. The notion of deficiency provides a principled way of approximating complicated channels by relatively simpler ones. Deficiencies have a rich heritage in the theory of comparison of statistical experiments and have an operational interpretation in terms of the optimal risk gap of decision problems. Experiments demonstrate that the deficiency bottleneck can provide advantages in terms of minimal sufficiency as measured by information bottleneck curves, while retaining a good test performance in a classification task. I will also talk about an unsupervised generalization and relation to variational autoencoders. Finally, I discuss the utility of our method in estimating a quantity called the unique information which quantifies a deviation from the Blackwell order.
(Joint work with Guido Montufar, Departments of Mathematics and Statistics, UCLA)

Word embeddings are a set of techniques commonly used in natural language processing to map the words of a dictionary to a real vector space. Such mapping is commonly learned through the contexts of the words in a text corpora, by the estimation of a set of conditional probability distributions - of a context word given the central word - for each word of the dictionary. These conditional probability distributions form a Riemannian statistical manifold, where word analogies can be computed through the comparison between vectors in the tangent bundle of the manifold. In this presentation we introduce a geometric framework for the study of word embeddings in the general setting of Information Geometry, and we show how the choice of the geometry allows to define different expressions for word similarities and word analogies. The presentation is based on a joint work with Riccardo Volpi.