Machine learning models, while effective in controlled environments, can fail catastrophically when exposed to unexpected conditions upon deployment. This lack of robustness, well-documented even in state-of-the-art models, can lead to severe harm in high-stakes, safety-critical application domains such as healthcare. This shortcoming raises a central question: How can we develop machine learning models we can trust?
In this talk, I will approach this question from a probabilistic perspective and address deficiencies in the trustworthiness of neural network models using Bayesian principles. Specifically, I will show how to improve the reliability and fairness of neural networks with data-driven domain-informed prior distributions over model parameters. To do so, I will first demonstrate how to perform training in neural networks with such priors using a simple variational optimization objective with a regularizer that reflects the constraints implicitly encoded in the prior. This regularizer is mathematically simple, easy to implement, and can be used as a drop-in replacement for existing regularizers when performing supervised learning in neural networks of any size. I will then show how to construct and use domain-informed data-driven priors to improve uncertainty quantification and group robustness in neural network models for selected application domains.

Sharpness of minima is a promising quantity that can correlate with generalization in deep networks and, when optimized during training, can improve generalization. However, standard sharpness is not invariant under reparametrizations of neural networks, and, to fix this, reparametrization-invariant sharpness definitions have been proposed, most prominently adaptive sharpness (Kwon et al., 2021). But does it really capture generalization in modern practical settings? We comprehensively explore this question in a detailed study of various definitions of adaptive sharpness in settings ranging from training from scratch on ImageNet and CIFAR-10 to fine-tuning CLIP on ImageNet and BERT on MNLI. We focus mostly on transformers for which little is known in terms of sharpness despite their widespread usage. Overall, we observe that sharpness does not correlate well with generalization but rather with some training parameters like the learning rate that can be positively or negatively correlated with generalization depending on the setup. Interestingly, in multiple cases, we observe a consistent negative correlation of sharpness with out-of-distribution error implying that sharper minima can generalize better. Finally, we illustrate on a simple model that the right sharpness measure is highly data-dependent, and that we do not understand well this aspect for realistic data distributions.

Statistical learning theory provides bounds on the necessary number of training samples needed to reach a prescribed accuracy in a learning problem formulated over a given target class. This accuracy is typically measured in terms of a generalization error, that is, an expected value of a given loss function. However, for several applications -- for example in a security-critical context or for problems in the computational sciences -- accuracy in this sense is not sufficient. In such cases, one would like to have guarantees for high accuracy on every input value, that is, with respect to the uniform norm. In this paper we precisely quantify the number of training samples needed for any conceivable training algorithm to guarantee a given uniform accuracy on any learning problem formulated over target classes containing (or consisting of) ReLU neural networks of a prescribed architecture. We prove that, under very general assumptions, the minimal number of training samples for this task scales exponentially both in the depth and the input dimension of the network architecture. As a corollary we conclude that the training of ReLU neural networks to high uniform accuracy is intractable. In a security-critical context this points to the fact that deep learning based systems are prone to being fooled by a possible adversary. We corroborate our theoretical findings by numerical results.
ArXiv / ICLR'23: https://arxiv.org/abs/2205.13531
https://openreview.net/forum?id=nchvKfvNeX0
Github: https://github.com/juliusberner/theory2practice
https://jberner.info/

I will revisit the problem of maximizing information divergence from a new perspective using logarithmic Voronoi polytopes. We will see that for linear models, the maximum is always achieved at the boundary of the probability simplex. For toric models, I will describe an algorithm that combines the combinatorics of the chamber complex with numerical algebraic geometry. I will pay special attention to reducible models and models of maximum likelihood degree one, with many colorful examples. This talk is based on joint work with Serkan Hoşten.

We develop an analytical framework to characterize the set of optimal ReLU neural networks by reformulating the non-convex training problem as a convex program. We show that the global optima of the convex parameterization are given by a polyhedral set and then extend this characterization to the optimal set of the non-convex training objective. Since all stationary points of the ReLU training problem can be represented as optima of sub-sampled convex programs, our work provide a general expression for all critical points of the non-convex objective. We then leverage our results to provide an optimal pruning algorithm for computing minimal networks, establish conditions for the regularization path of ReLU networks to be continuous, and develop sensitivity results for minimal ReLU networks.

Model reparametrization, which follows the change-of-variable rule of calculus (not to be confused with weight-space symmetry), is a popular way to improve the training of neural nets, e.g. in WeightNorm. But it can also be problematic since it can induce inconsistencies in, e.g., Hessian-based flatness measures, optimization trajectories, and modes of probability densities. This complicates downstream analyses: e.g. one cannot definitively relate flatness with generalization since arbitrary reparametrization changes their relationship. In this talk, I will present a study of the invariance of neural nets under reparametrization from the perspective of Riemannian geometry. From this point of view, invariance is an inherent property of any neural net if one explicitly represents the metric and uses the correct associated transformation rules. This is important since although the metric is always present, it is often implicitly assumed as identity, and thus dropped from the notation, then lost under reparametrization. I will discuss implications for measuring the flatness of minima, optimization, and for probability-density maximization. As a bonus, I will also give a teaser of our other recent work in exploiting the geometry of preconditioning matrices to develop an inverse-free, structured KFAC-like second-order optimization method for very large, modern neural nets like transformers. The resulting method is thus numerically stable in low precision and also memory efficient.

This talk will discuss some nontrivial but often pleasant effects of large learning rates, which are commonly used in machine learning practice for improved empirical performances, but defy traditional theoretical analyses. I will first quantify how large learning rates can help gradient descent in multiscale landscape escape local minima, via chaotic dynamics, which provides an alternative to the commonly known escape mechanism due to noises from stochastic gradients. I will then report how large learning rates provably bias toward flatter minimizers. Several related, perplexing phenomena have been empirically observed recently, including Edge of Stability, loss catapulting, and balancing. I will, for the first time, unify them and explain that they are all algorithmic implicit biases of large learning rates. These results are enabled by a first global convergence proof of gradient descent for certain nonconvex functions without Lipschitz gradient. This theory will also provide understanding of when there will be Edge of Stability and other large learning rate implicit biases.

Designing effective positional encodings for graphs is key to building powerful graph transformers and enhancing message-passing graph neural networks’ expressive power. However, since there lacks a canonical order of nodes in the graph-structure data, the choice of positional encodings for graphs is often tricky. For example, Laplacian eigenmap is used as positional encodings in many works. However, it faces two fundamental challenges: (1) Non-uniqueness: there are many different eigendecompositions of the same Laplacian, and (2) Instability: small perturbations to the Laplacian could result in completely different eigenvectors, leading to unpredictable changes in positional encoding. This is governed by the Davis-Kahan theorem, which further negatively impacts the model generalization. In this talk, we are to introduce some ideas on building stable positional encoding and show their benefits in model out-of-distribution generalization. The idea can be extended to some other types of node positional encodings. Finally, we evaluate the effectiveness of our method on molecular property prediction, link prediction, and out-of-distribution generalization tasks, finding improved generalization compared to existing positional encoding methods.
I will mainly talk about these two papers:
1. Equivariant and Stable Positional Encoding for More Powerful Graph Neural Networks Haorui Wang, Haoteng Yin, Muhan Zhang, Pan Li
2. On the Stability of Expressive Positional Encodings for Graphs Yinan Huang, William Lu, Joshua Robinson, Yu Yang, Muhan Zhang, Stefanie Jegelka, Pan Li

Neural networks with random weights appear in a variety of machine learning applications, most prominently as the initialization of many deep learning algorithms and as a computationally cheap alternative to fully learned neural networks. The first goal of this talk is to enhance the theoretical understanding of random neural networks by addressing the following data separation problem: under which conditions can a random neural network make two classes (with positive distance) linearly separable? I will show that a sufficiently large two-layer ReLU-network with Gaussian weights and uniformly distributed biases can solve this problem with high probability. Building on the insights behind this result, I will next present a simple randomized algorithm to produce a small interpolating neural net for a given dataset with two classes. In both results, the size of the network is explicitly linked to geometric properties of the two classes and their mutual arrangement. This instance-specific viewpoint allows to overcome the curse of dimensionality and to prove memorization results beyond worst case lower bounds.

In this talk, I will present a framework, inspired by random matrix theory, for analyzing the dynamics of stochastic optimization algorithms (e.g., stochastic gradient descent (SGD) and momentum (SGD + M)) when both the number of samples and dimensions are large. Using this new framework, we show that the dynamics of optimization algorithms on generalized linear models and multi-index problems with random data become deterministic in the large sample and dimensional limit. In particular, the limiting dynamics for stochastic algorithms are governed by an ODE. In the least square setting, from this model, we identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of O(1/ κ), matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD+M has rates that scale like a multiple of the single batch SGD rate. We give explicit choices for the learning rate and momentum parameter in terms of the Hessian spectra that achieve this performance. Finally we show this model matches performances on real data sets.

In many problems arising in data science and scientific engineering, one must reconstruct a signal from few corrupted measurements, a class of problems known more broadly as inverse problems. Due to the ill-posedness of such problems, one requires a prior: a mathematical model of what makes some signals natural. Recently, there have been great empirical strides in deep generative models, a class of neural network-based models that learn to sample signals from a distribution of interest. Compared to classical priors like sparsity, generative priors lack the same understanding of when and why they work well, especially in relation to notions such as statistical complexity and computational efficiency. Here, I will discuss some theory and applications of using such priors in inverse problems. I will first present a rigorous recovery guarantee for generative priors in the nonlinear inverse problem of phase retrieval. I will show that generative models enable an efficient algorithm from generic measurements with optimal sample complexity. In the second part of the talk, I will address how one can exploit ideas from this approach to tackle problems where ground-truth data may be unavailable in practice, including problems arising from black hole imaging.

I will introduce the idea of phases and phase transitions of the Bayesian posterior in the setting of singular learning theory, and discuss how a simple auto-encoder model introduced by Anthropic in their research on neural network interpretability displays a rich set of phase transitions in both the posterior and over the course of training. I’ll explain a research program we term “developmental” interpretability that is aiming to use phase transitions as the basic primitive for understanding the internal structure of computation in neural networks.

Despite the widespread proliferation of neural networks, the mechanisms through which they operate so successfully are not well understood. In this talk, we will first explore empirical and theoretical investigations into neural network training and generalization and what they can tell us about why deep learning works. Then, we will examine a recent line of work on algorithm learning. While typical neural networks are designed for pattern matching tasks, we consider whether neural networks can learn algorithms that scale to problem instances orders of magnitude larger than those seen during training.

In this talk I’ll discuss recent work studying benign overfitting in two-layer ReLU networks trained using gradient descent and hinge loss on noisy data for binary classification. In particular, we consider linearly separable data for which a relatively small proportion of labels are corrupted and identify conditions on the margin of the clean data which give rise to three distinct training outcomes: benign overfitting, in which zero loss is achieved and with high probability test data is classified correctly; overfitting, in which zero loss is achieved but test data is misclassified with probability lower bounded by a constant; and non-overfitting, in which clean points, but not corrupt points, achieve zero loss and again with high probability test data is classified correctly. Our analysis provides a fine-grained description of the dynamics of neurons throughout training and reveals two distinct phases: in the first phase clean points achieve close to zero loss, in the second phase clean points oscillate on the boundary of zero loss while corrupt points either converge towards zero loss or are eventually zeroed by the network. We prove these results using a combinatorial approach that involves bounding the number of clean versus corrupt updates across these phases of training.

The convergence of scientific computing with statistics and machine learning is an exciting development in scientific computing. In this talk, I will present two lines of work that blur the line between statistical inference and numerical computation. The first part of the talk presents state-of-the-art algorithms for solving elliptic PDEs by interpreting them as Gaussian processes and exploiting their conditional independence properties. This approach allows the efficient learning of elliptic solution operators from solution pairs, with promising empirical results on fractional order PDEs and empirical closure models of turbulent flows.
The second part of the talk discusses how to mitigate the formation of shock singularities in the barotropic Euler equations using an inviscid regularization. This work combines the seminal work of Vladimir Arnold on geometric hydrodynamics with ideas from interior point methods for positive definite programming and the information geometry of Amari and Chentsov.

Recent years have seen substantial interest in a first-principles theoretical understanding of the behavior of overparameterized models that interpolate noisy training data, based on their surprising empirical success. In this talk, I compare classification and regression tasks in the overparameterized linear model. On the one hand, we show that with sufficient overparameterization, solutions obtained by training on the squared loss ( minimum-norm interpolation) typically used for regression, are identical to those produced by training on exponential and polynomially-tailed losses (e.g. the max-margin support-vector-machine), typically used for classification. On the other hand, we show that there exist regimes where these solutions are consistent when evaluated by the 0-1 test loss function, but inconsistent if evaluated by the mean-squared-error test loss function. Our results demonstrate that: a) different loss functions at the training (optimization) phase can yield similar solutions, and b) a significantly higher level of effective overparameterization admits good generalization in classification tasks as compared to regression tasks.

Deep learning has been wildly successful in practice and most state-of-the-art artificial intelligence systems are based on neural networks. Lacking, however, is a rigorous mathematical theory that adequately explains the amazing performance of deep neural networks. In this talk, I present a new mathematical framework that provides the beginning of a deeper understanding of deep learning. This framework precisely characterizes the functional properties of trained neural networks. The key mathematical tools which support this framework include transform-domain sparse regularization, the Radon transform of computed tomography, and approximation theory. This framework explains the effect of weight decay regularization in neural network training, the importance of skip connections and low-rank weight matrices in network architectures, the role of sparsity in neural networks, and explains why neural networks can perform well in high-dimensional problems.

Training deep neural networks for classification often includes minimizing the training loss beyond the zero training error point. In this phase of training, a "neural collapse" (NC) behavior has been empirically observed: the variability of features (outputs of the penultimate layer) of within-class samples decreases and the mean features of different classes approach a certain tight frame structure. Recent papers have shown that minimizers with this structure emerge when optimizing a simplified "unconstrained features model" (UFM) with a regularized cross-entropy loss. In this talk, I will start with a review on the empirical findings on the NC phenomenon. Then, I will present some of our theoretical results (from our publications at ICML22 and ICML23). Namely, we show that the minimizers of a UFM exhibit (a somewhat more delicate) NC structure also with regularized MSE loss, and we analyze the gradient flow in this case, identifying the different dynamics of the features' within- and between-class covariance matrices. Then, we extend the UFM in two different ways to provide mathematical reasoning for the depthwise empirical NC behavior and the effect of regularization hyperparameters on the closeness to collapse. We support our theory with experiments in practical deep learning settings.

In this talk, we will propose a new model and algorithms that go beyond worst-case rates in sequential prediction. In particular, we will focus on sequential prediction over a stochastic sequence with an adversary that is allowed to inject clean-label adversarial (or out-of-distribution) examples as and when they desire. Traditional algorithms designed to handle purely stochastic data tend to fail in the presence of such adversarial examples, often leading to erroneous predictions, whereas, assuming fully adversarial data leads to very pessimistic bounds that are often vacuous in practice. To mitigate this, we will introduce a new model of sequential prediction that sits between the purely stochastic and fully adversarial settings by allowing the learner to abstain from making a prediction at no cost on adversarial examples. Assuming access to the marginal distribution on the non-adversarial examples, we will present a learner whose error scales with the VC dimension (mirroring the stochastic setting) of the hypothesis class, as opposed to the Littlestone dimension which characterizes the fully adversarial setting. Furthermore, we will design learners for certain special hypothesis classes including VC dimension 1 classes, which work even in the absence of access to the marginal distribution. We will conclude with several open questions and plausible connections of our framework with existing models. This is based on joint work with Steve Hanneke, Shay Moran, and Abhishek Shetty.

Inspired by the remarkable success of deep neural networks, there has been significant interest in understanding the generalization performance of overparameterized models. Substantial efforts have been invested in characterizing how optimization algorithms impact generalization through their “preferred” solutions, a phenomenon commonly referred to as implicit regularization. For instance, it has been argued that gradient descent (GD) induces an implicit $\ell_2$-norm regularization in regression and classification problems. However, in prior literature, the implicit regularization of different algorithms are confined to either a specific geometry or a particular class of learning problems. To address this gap, we present a unified approach using mirror descent (MD), a notable generalization of GD, to control implicit regularization in both regression and classification settings. More specifically, we show that MD with the general class of homogeneous potential functions converges in direction to a generalized maximum-margin solution for linear classification problems, thereby answering a long-standing question in the classification setting. Furthermore, MD can be implemented efficiently and we demonstrate through experiments that MD is a versatile method to produce learned models with different regularizers and different generalization performances.

How do we ensure that the machine learning algorithms in high-stakes applications, such as hiring, lending, admissions, etc., are fair, explainable, and lawful? Towards addressing this urgent question, this talk will provide some strategies that are deep-rooted in information theory, causality, and statistics. I will discuss a question that bridges the fields of fairness, explainability, and law: how do we check if the disparity in a model is purely due to critical occupational necessities or not? We propose a systematic measure of the legally non-exempt disparity, that brings together a body of work in information theory called Partial Information Decomposition and connects it with causality. I will also briefly talk about some of our other research interests in related topics, such as quantifying accuracy-fairness tradeoffs using Chernoff Information, and also robust counterfactual explanations with theoretical guarantees.

I will describe recent progress on providing rigorous convergence guarantees for score-based generative models (SGMs) such as denoising diffusion probabilistic models (DDPMs), which constitute the backbone of large-scale real-world generative models such as DALL⋅E 2. In the first part of the talk, I will show that such SGMs can efficiently sample from essentially any realistic data distribution, even ones which are highly non-log-concave. In the second part of the talk, I will show how to extend these guarantees to deterministic samplers based on discretizing the so-called probability flow ODE, which ultimately leads to better dependence on the dimension. All of these results assume access to an oracle for score estimation; time permitting, at the end I will briefly touch upon how to provably implement this oracle for interesting classes of distributions like Gaussian mixtures.
Based on the following works: https://arxiv.org/abs/2209.11215, https://arxiv.org/abs/2303.03384, https://arxiv.org/abs/2305.11798, https://arxiv.org/abs/2307.01178

In practice, encoding group invariances into models improves sample complexity. In this talk, we study this phenomenon from a theoretical perspective. In the first part, we provide minimax optimal rates for kernel ridge regression on compact manifolds, with a target function that is invariant to a Lie group action on the manifold. For a finite group, the gain effectively multiplies the number of samples by the group size. For groups of positive dimension, the gain is observed by a reduction in the manifold’s dimension, in addition to a factor proportional to the volume of the quotient space. Our proof takes the viewpoint of differential geometry, in contrast to the more common strategy of using invariant polynomials.
In the second part, we apply our techniques to group-invariant probability distributions that appear in many data-generative models in machine learning, such as graphs, point clouds, and images, and we prove tight sample complexity bounds for the Wasserstein distance estimation under invariances, as well as the density estimation problem.
This talk is based on joint work with Stefanie Jegelka.

Recent theoretical work has shown that under certain conditions, massively overparameterized neural networks are equivalent to kernel regressors with a family of kernels called Neural Tangent Kernels (NTKs). My work in this subject aims to better understand the properties of NTK for various network architectures and relate them to the inductive bias of real neural networks. In particular, I will argue that for input data distributed uniformly on the sphere NTK favors low-frequency predictions over high-frequency ones, potentially explaining why overparameterized networks can generalize even when they perfectly fit their training data. I will further discuss the behavior of NTK when data is distributed nonuniformly and show that NTK (with ReLU activation) is tightly related to the classical Laplace kernel, which has a simple closed-form. Finally, I will discuss our analysis of NTK for convolutional networks, which indicates that these networks are biased toward learning low frequency target functions with any higher frequencies concentrated in local regions. Overall, our results suggest that much insight about neural networks can be obtained from the analysis of NTK.

Algorithmic reasoning requires capabilities which are most naturally understood through recurrent models of computation, like the Turing machine. However, Transformer models, while lacking recurrence, are able to perform such reasoning using far fewer layers than the number of reasoning steps. This raises the question: what solutions are these shallow and non-recurrent models finding? In this talk, we will formalize reasoning in the setting of automata, and show that the computation of an automaton on an input sequence of length T can be replicated exactly by Transformers with o(T) layers, which we call "shortcuts". We provide two constructions, with O(log T) layers for all automata and O(1) layers for solvable automata. Empirically, our results from synthetic experiments show that shallow solutions can also be found in practice.

Understanding the training dynamics of neural networks is quite difficult in general due to the highly nonlinear nature of the parameterization. A breakthrough in the theory of deep learning was the finding that in the infinite-width limit the gradient descent dynamics are characterized by a fixed kernel, coined the “Neural Tangent Kernel” (NTK). In this limiting regime the network is biased to learn the eigenvectors/eigenfunctions of the NTK at rates corresponding to their eigenvalues, a phenomenon known as “spectral bias”. Considerable work has been done comparing the training dynamics of finite-width networks to the idealized infinite-width dynamics. These works typically compare the dynamics of a finite-width network to the dynamics of an infinite-width network where both networks are optimized via the empirical risk. In this work we compare a finite-width network trained on the empirical risk to an infinite-width network trained on the population risk. Consequentially, we are able to demonstrate that the finite-width network is biased towards learning the top eigenfunctions of the NTK over the entire input domain, as opposed to describing the dynamics merely on the training set. Furthermore we can demonstrate that this holds in a regime where the network width is on the same order as the number of training samples, in contrast with prior works that require the unrealistic assumption that the network width is polynomially large in the number of samples. In a separate line of analysis, we characterize the spectrum of the NTK by expressing the NTK as a power series. We demonstrate that the NTK has a small number of large outlier eigenvalues and that the number of such eigenvalues is largely inherited from the structure of the input data. As a result we shed further insight into why the network places a preference on learning a small number of components quicker. In total, our results help classify the properties networks are biased towards in a variety of settings, which we hope will lead to more interpretable artificial intelligence in the long term.

High dimensional probability is a powerful tool to understand phenomena that typically occur in deep learning models. In particular, in this talk, we discuss different concentration bounds that entail consequences about (i) the gradient descent optimization of a deep neural network, (ii) the adversarial robustness of its solution, and (iii) its privacy guarantees. The first main result implies that minimally overparameterized deep neural networks can be successfully (0 training loss) optimized with gradient descent, and leverages tight lower bounds on the smallest eigenvalue of the neural tangent kernel (NTK). The second builds on the recently proposed universal law of robustness, provides sharper bounds on the random features and NTK model, and consequently addresses the conjecture opened by Bubeck, Li and Nagaraj. Finally, we draw a connection between the generalization performance of a model and its privacy guarantees, measured in terms of its safety against a family of information recovery black-box attacks.

Studying the generalization abilities of linear models with real data is a central question in statistical learning. While there exist a limited number of prior works that do validate theoretical work with real data, these works have limitations due to technical assumptions. These assumptions include having a well-conditioned covariance matrix and having independent and identically distributed data. Additionally, prior works that do address distributional shifts usually make technical assumptions on the joint distribution of the train and test data, and do not test on real data. Previous work has also shown that double descent can occur in the over-parameterized regime, and believe that the standard bias-variance trade-off holds in the under-parameterized regime.
In an attempt to address these issues and better model real data, we look at data that is not I.I.D. but has a low-rank structure. Further, we address distributional shift by decoupling assumptions on the training and test distribution. We provide analytical formulas for the generalization error of the denoising problem that are asymptotically exact. These are used to derive theoretical results for linear regression, data augmentation, principal component regression, and transfer learning. We validate all of our theoretical results on real data and have a low relative mean squared error of around 1% between the empirical risk and our estimated risk. Further, we present a simple examplein this paradigm that provably exhibits double descent in the under-parameterized regime.

It was observed that large language models exhibit a power-law decay of cross entropy with respect to the number of parameters and training tokens. When extrapolated literally, this decay implies that the entropy rate of natural language is zero. To understand this phenomenon - or an artifact - better, we construct a simple stationary stochastic process and its memory-based predictor that exhibit a power-law decay of cross entropy with the vanishing entropy rate. Our example is based on previously discussed Santa Fe processes, which decompose a random text into a process of narration and time-independent knowledge. Previous discussions assumed that narration is a memoryless source with Zipf's distribution. In this talk, we propose a model of narration that has the vanishing entropy rate and applies a randomly chosen deterministic sequence called a multiperiodic sequence. Under a suitable parameterization, multiperiodic sequences exhibit asymptotic relative frequencies given by Zipf's law. Remaining agnostic about the value of the entropy rate of natural language, we discuss relevance of similar constructions for language modeling.
Based on: https://arxiv.org/abs/2302.09049

Pruning neural networks at initialization (PaI) has received an upsurge of interest due to its end-to-end saving potential. PaI is able to find sparse subnetworks at initialization that can achieve comparable performance to the full networks. These methods can surpass the trivial baseline of random pruning but suffer from a significant performance gap compared to post-training pruning. Previous approaches firmly rely on weights, gradients, and sanity checks as primary signals when conducting PaI analysis. To better understand the underlying mechanism of PaI, we propose to interpret it through the lens of the Ramanujan Graph - a class of expander graphs that are sparse while being highly connected. It is often believed there should be a strong correlation between the Ramanujan graph and PaI since both are about finding sparse and well-connected neural networks. However, the finer-grained link relating highly sparse and connected networks to their relative performance (i.e., ranking of difference sparse structures at the same specific global sparsity) is still missing. We observe that not only the Ramanujan property for sparse networks shows no significant relationship to PaI’s relative performance, but maximizing it can also lead to the formation of pseudo-random graphs with no structural meanings. We reveal the underlying cause to be Ramanujan Graph’s strong assumption on the upper bound of the largest nontrivial eigenvalue (\mu) of layers belonging to highly sparse networks. We hence propose Iterative Mean Difference of Bound (IMDB) as a mean to relax the \mu upper bound. Likewise, we also show there exists a lower bound for \mu, which we call the Normalized Random Coefficient (NaRC), that gives us an accurate assessment for when sparse but highly connected structure degenerates into naive randomness. Finally, we systematically analyze the behavior of various PaI methods and demonstrate the utility of our proposed metrics in characterizing PaI performance. We show that subnetworks preserving better the IMDB property correlate higher in performance, while NaRC provides us with a possible mean to locate the region where highly connected, highly sparse, and non-trivial Ramanujan expanders exist.

Flatness of the loss curve is conjectured to be connected to the generalization ability of machine learning models, in particular neural networks. While empirical evidence consistently shows a strong correlation between flatness measures and generalization, the theoretical explanation for this connection remains an open problem. Our recent paper, "Relative Flatness and Generalization", proposes an explanation for this relationship by linking it to a model's robustness under feature perturbations and a measure of how representative the training data is of its underlying distribution. We establish a necessary condition for the connection between flatness and generalization to hold.
In this talk, I will provide an overview of our paper's results, focusing on the motivations, insights, and limitations of our approach.

In this talk I will present various insights on the double descent curve obtained by considering a solvable model for deep learning : the random feature model. First, I will present a fine-grained bias-variance decomposition and show how the double descent curve can be reconciled with the traditional bias-variance tradeoff. Then, I will show that two different kinds of overfitting, which are often conflated, can give rise to a “double descent” curve, and can actually occur simultaneously, leading to a triple descent curve. Finally, I will extend some of these findings to classification tasks on structured data, showing the impact of the loss function and the role of low-dimensional structures.

The ability to make sequential decisions under uncertainty is a key component of intelligence. Despite impressive breakthroughs in deep learning in the last decade, we find that scalable and generalizable decision making has so far been elusive for current artificial intelligence (AI) systems. In this talk, I will present a new framework for sequential decision making that is derived from modern generative models for language and perception. We will instantiate our framework in 3 different paradigms for sequential decision making: offline reinforcement learning (RL), online RL, and black-box optimization, and highlight the simplicity and effectiveness of this unifying framework on a range of challenging high-dimensional benchmarks for sequential decision making.

I will present, in this talk, the implicit bias of gradient flow on linear equivariant steerable networks in group-invariant binary classification. Our findings reveal that the parameterized predictor converges in direction to the unique group-invariant classifier with a maximum margin defined by the input group action. Under a unitary assumption on the input representation, we establish the equivalence between steerable networks and data augmentation. Furthermore, we demonstrate the improved margin and generalization bound of steerable networks over their non-invariant counterparts. This talk is based on a joint work with Ziyu Chen at UMass Amherst.

In this talk, I will discuss our research on understanding the infinite-width limit of neural networks. In this limit, neural networks correspond to Neural Network Gaussian Processes (NNGPs) and Neural Tangent Kernels (NTKs). I will first describe our empirical study exploring the relationship between wide neural networks and neural kernel methods. Our study resolves a variety of open questions related to infinitely wide neural networks and opens up new interesting questions. In the second half of the talk, I will discuss our recent work on scaling up infinite-width neural kernel methods to millions of data points. There are unique challenges in scaling up neural kernel methods, and I will talk about our attempts to overcome them. If there is time, I will discuss some applications of the infinite-width limit of neural networks, such as dataset distillation, neural architecture search, uncertainty quantification, and neural scaling laws.

Score-based generative models and diffusion probabilistic models have exhibited exceptional performance in various applications. A natural question that arises is whether the distribution generated by the model is closely aligned with the given data distribution. In this talk, we will explore an upper bound of the Wasserstein distance between these two distributions. Based on the theory of optimal transport, we guarantee that the framework can approximate data distributions in the space of probability measures equipped with the Wasserstein distance.
This talk is based on joint work with Ying Fan and Kangwook Lee (UW-Madison).

Stochastic finite-sum optimization problems are ubiquitous in many areas such as machine learning, and stochastic optimization algorithms to solve these problems are actively studied in the literature. However, there is a major gap between practice and theory: practical algorithms shuffle and iterate through component indices, while most theoretical analyses of these algorithms assume uniformly sampling the indices in an i.i.d. manner. In this talk, we talk about recent research efforts to close this theory-practice gap. We will discuss recent developments in the convergence analysis of shuffling-based optimization methods. We will first consider minimization algorithms, mainly focusing on stochastic gradient descent (SGD) with shuffling; we will then briefly talk about some new progress on minimax optimization methods.

Transport maps characterize probability distributions by coupling random variables using a deterministic transformation. While bijective transport maps enable sampling and density evaluations of the transformed random variable, learning the parameters of such transformations in high dimensions is challenging given few samples from an unknown target distribution. Moreover, structural choices for these transformations can have a significant impact on their performance and the optimization procedure for finding these maps. In this talk, we will present a framework for representing and learning monotone triangular maps via invertible transformations of smooth functions, and will first demonstrate that the associated optimization problem has no spurious local minima, i.e., all local minima are global minima. Second, we propose a sample-efficient adaptive algorithm that construct a sparse approximation for the map. We demonstrate how this framework can be applied for joint and conditional density estimation, likelihood-free inference, and structure learning of directed graphical models with stable generalization performance across a range of sample sizes.

First-order optimization algorithms are widely used today. Two standard building blocks in these algorithms are proximal operators (proximals) and gradients. Although gradients can be computed for a wide array of functions, explicit proximal formulas are only known for limited classes of functions. We provide an algorithm, HJ-Prox, for accurately approximating such proximals. This is derived from a collection of relations between proximals, Moreau envelopes, Hamilton-Jacobi (HJ) equations, heat equations, and importance sampling. In particular, HJ-Prox smoothly approximates the Moreau envelope and its gradient. The smoothness can be adjusted to act as a denoiser. Our approach applies even when functions are only accessible by (possibly noisy) blackbox samples. Our approach can also be embedded into a zero-order algorithm with guaranteed convergence to global minima, assuming continuity of the objective function; this is done by leveraging connections between the gradient of the Moreau envelope and the proximal operator. We show HJ-Prox is effective numerically via several examples.

Policy gradient algorithms are amongst the most widely used reinforcement learning methods and have driven a lot of recent empirical success in applications of reinforcement learning, especially with deep neural network based policies. These methods apply to complex, poorly understood control problems by performing stochastic gradient descent over a parameterized class of policies. However, even for simple control problems solvable by classical dynamic programming techniques, policy gradient algorithms face non-convex optimization problems and are widely understood to converge only to a stationary point. We make fundamental progress in developing a theory for policy gradient methods - providing insights into when and why policy gradients are guaranteed to converge to globally optimal policies. In particular, we identify structural properties of the underlying markov decision process (MDP) which ensure that despite non-convexity, the policy gradient objective function has no suboptimal stationary points. We also show that the policy gradient objective satisfies a Polyak-lojasiewicz (gradient dominance) condition when some of these conditions are strengthened, yielding convergence rates. When some of these conditions are relaxed, we provide bounds on the optimality gap of any stationary point. Our results leverage on a key connection to policy iteration, a classic and well-studied dynamic programming algorithm, via a policy gradient theorem.

The parameter space for any fixed architecture of neural networks serves as a proxy during training for the associated class of functions - but how faithful is this representation? For any fixed feedforward ReLU network architecture with at least one hidden layer, it is well-known that many different parameter settings can determine the same function. It is less well-known that the degree of this redundancy is inhomogeneous across parameter space. This inhomogeneity should impact the dynamics of training via gradient descent, especially when compared with recent work suggesting that SGD favors flat minima of the loss landscape.
In this talk, I will carefully define the notion of the local functional dimension of a feedforward ReLU network function, discuss the relationship between local functional dimension of a parameter and the geometry of the underlying decomposition of the domain into linear regions, and present some preliminary experimental results suggesting that the probability distribution of the functional dimension at initialization is both interesting and architecture-dependent. Some of this work is joint with Kathryn Lindsey, Rob Meyerhoff, and Chenxi Wu, and some is joint with Kathryn Lindsey and David Rolnick.

We revisit the challenging problem of training Gaussian-Bernoulli restricted Boltzmann machines (GRBMs), introducing two innovations. We propose a novel Gibbs-Langevin sampling algorithm that outperforms existing methods like Gibbs sampling. We modified the contrastive divergence (CD) algorithm in two ways: 1) adding variance-dependent initial step sizes for negative sampling; 2) drawing initial negative samples from Gaussian noise. We show this modified CD along with gradient clipping is enough to robustly train GRBMs with large learning rates, thus removing the need for various tricks in the literature. Moreover, it enables GRBMs to generate samples starting from noise, thus allowing direct comparisons with deep generative models and improving evaluation protocols in the RBM literature. Experiments on Gaussian Mixtures, MNIST, FashionMNIST, and CelebA show GRBMs can generate good samples, despite their single-hidden-layer architecture. Our code is released: https://github.com/lrjconan/GRBM.

The recipe behind the success of deep learning has been the combination of neural networks and gradient-based optimization. Understanding the behavior of gradient descent however, and particularly its instability, has lagged behind its empirical success. To add to the theoretical tools available to study gradient descent we propose the principal flow (PF), a continuous time flow that approximates gradient descent dynamics. To our knowledge, the PF is the only continuous flow that captures the divergent and oscillatory behaviors of gradient descent, including escaping local minima and saddle points. Through its dependence on the eigendecomposition of the Hessian the PF sheds light on the recently observed edge of stability phenomena in deep learning. Using our new understanding of instability we propose a learning rate adaptation method which enables us to control the trade-off between training stability and test set evaluation performance.

Conventionally, the stability of stochastic gradient descent (SGD) is understood through a linear stability analysis, where the mean and variance of the parameter or the gradients are examined to determine the stability of SGD close to a stationary point. In this seminar, we discuss the limitations of linear stability theories and motivate a new notion of stability, which we call the probabilistic stability. We first explain why this notion of stability is especially suitable for understanding SGD at a large learning rate and a small batch size in toy problems. Then, with this new notion of stability, we study the implicit bias of SGD and show that SGD at a large learning rate converges to low-rank saddles in matrix factorization problems.
The talk is mainly based on the following two works:
[1] Liu Ziyin, Botao Li, James B. Simon, Masahito Ueda. SGD with a Constant Large Learning Rate Can Converge to Local Maxima. ICLR 2022.
[2] The Probabilistic Stability of SGD. (tentative title, in preparation)

Ordinary differential equation (ODE) is widely used in modeling biological and physical processes in science. In this talk, I will discuss a new reproducing kernel-based approach for estimation and inference of ODE given noisy observations. We do not assume the functional forms in ODE to be known, or restrict them to be linear or additive, and we allow pairwise interactions. We construct confidence intervals for the estimated signal trajectories, and establish the estimation optimality and selection consistency of kernel ODE under both the low-dimensional and high-dimensional settings, where the number of unknown functionals can be smaller or larger than the sample size. Our proposal builds upon the smoothing spline analysis of variance (SS-ANOVA) framework, but tackles several important problems that are not yet fully addressed, and thus extends the scope of existing SS-ANOVA too. Our proposal is also advantageous, in terms of statistical inference with noisy observations, compared to the existing physics-informed neural networks and sparsity-based methods.

Existing analyses of neural network training often operate under the unrealistic assumption of an extremely small learning rate. This lies in stark contrast to practical wisdom and empirical studies, such as the work of J. Cohen et al. (ICLR 2021), which exhibit startling new phenomena (the "edge of stability" or "unstable convergence") and potential benefits for generalization in the large learning rate regime. Despite a flurry of recent works on this topic, however, the latter effect is still poorly understood. In this paper, we take a step towards understanding genuinely non-convex training dynamics with large learning rates by performing a detailed analysis of gradient descent for simplified models of two-layer neural networks. For these models, we provably establish the edge of stability phenomenon and discover a sharp phase transition for the step size below which the neural network fails to learn "threshold-like" neurons (i.e., neurons with a non-zero first-layer bias). This elucidates one possible mechanism by which the edge of stability can in fact lead to better generalization, as threshold neurons are basic building blocks with useful inductive bias for many tasks.

The MBO scheme is a highly performant scheme used for data clustering. Given some data, one constructs the similarity graph associated to the data points. The goal is to split the data into meaningful clusters. The algorithm produces the clusters by alternating between diffusion on the graph and pointwise thresholding. In this talk I will present the first theoretical studies of the scheme in the large data limit. We will see how the final state of the algorithm is asymptotically related to minimal surfaces in the data manifold and how the dynamics of the scheme is asymptotically related to the trajectory of steepest descent for surfaces, which is mean curvature flow. The tools employed are variational methods and viscosity solutions techniques. Based on joint work with Tim Laux (U Bonn).

Modern machine learning methods, in particular deep learning approaches, have enjoyed unparalleled success in a variety of challenging application fields like image recognition, medical image reconstruction, and natural language processing. While a vast majority of previous research in machine learning mainly focused on constructing and understanding models with high predictive power, consensus has emerged that other properties like stability and robustness of models are of equal importance and in many applications essential. This has motivated researchers to investigate the problem of adversarial training (or how to make models robust to adversarial attacks), but despite the development of several computational strategies for adversarial training and some theoretical development in the broader distributionally robust optimization literature, there are still several theoretical questions about it that remain relatively unexplored. In this talk, I will take an analytical perspective on the adversarial robustness problem and explore three questions: 1)What is the connection between adversarial robustness and inverse problems?, 2) Can we use analytical tools to find lower bounds for adversarial robustness problems?, 3) How do we use modern tools from analysis and geometry to solve adversarial robustness problems? At its heart, this talk is an invitation to view adversarial machine learning through the lens of mathematical analysis, showcasing a variety of rich connections with perimeter minimization problems, optimal transport, mean field PDEs of interacting particle systems, and min-max games in spaces of measures.

Empirical observation of high dimensional phenomena, such as the double descent behavior, has attracted a lot of interest in understanding classical techniques such as kernel methods, and their implications to explain generalization properties of neural networks that operate close to kernel regime. Many recent works analyze such models in a certain high-dimensional regime where the covariates are generated by independent sub-Gaussian random variables transformed through a covariance matrix and the number of samples and the number of covariates grow at a fixed ratio (i.e. proportional asymptotics). In this work we show that for a large class of kernels, including the neural tangent kernel of fully connected networks, kernel methods can only perform as well as linear models in this regime.
More surprisingly, when the data is generated by a Gaussian process model where the relationship between input and the response could be very nonlinear, we show that linear models are in fact optimal, i.e. linear models achieve the minimum risk among all models, linear or nonlinear.
These results suggest that more complex models for the data other than independent features are needed for high-dimensional analysis.

Large-scale optimization is at the forefront of modern data science, scientific computing, and applied mathematics with areas of interest, including high-dimensional PDE, inverse problems, machine learning, etc. First-order methods are workhorses for large-scale optimization due to modest computational cost and simplicity of implementation. Nevertheless, these methods are often agnostic to the structural properties of the problem under consideration and suffer from slow convergence, being trapped in bad local minima, etc. Natural gradient descent is an acceleration technique in optimization that takes advantage of the problem's geometric structure and preconditions the objective function's gradient by a suitable "natural" metric. Hence parameter update directions correspond to the steepest descent on a corresponding "natural" manifold instead of the Euclidean parameter space rendering a parametrization invariant descent direction on that manifold. Despite its success in statistical inference and machine learning, the natural gradient descent method is far from a mainstream computational technique due to the computational complexity of calculating and inverting the preconditioning matrix. This work aims at a unified computational framework and streamlining the computation of a general natural gradient flow via the systematic application of efficient tools from numerical linear algebra. We obtain efficient and robust numerical methods for natural gradient flows without directly calculating, storing, or inverting the dense preconditioning matrix. We treat Euclidean, Wasserstein, Sobolev, and Fisher–Rao natural gradients in a single framework for a general loss function.

Neural networks are powerful feature extractors - but which features do they extract from their data? And how does the structure of the training data shape the representations they learn? We discuss two recent works in this direction. We first develop a simple model for images and show that a neural network trained on these images can learn a convolution from scratch [1]. This pattern-formation process is driven by a combination of translation-invariance of the "images" and the non-Gaussian, higher-order statistics of the inputs. Second, we conjecture a "distributional simplicity bias" whereby neural networks learn increasingly complex distributions of their inputs during training. We present analytical and experimental evidence for this conjecture, going from a simple perceptron up to deep ResNets [2].References:[1] Ingrosso & Goldt, PNAS 119 (40), e2201854119, arXiv:2202.00565[2] Refinetti, Ingrosso & Goldt, in preparation.

We study the optimization landscape of deep linear neural networks with the square loss. It is known that, under weak assumptions, there are no spurious local minima and no local maxima. However, the existence and diversity of non-strict saddle points, which can play a role in first-order algorithms' dynamics, have only been lightly studied. We go a step further with a full analysis of the optimization landscape at order 2. We characterize, among all critical points, which are global minimizers, strict saddle points, and non-strict saddle points. We enumerate all the associated critical values. The characterization involves conditions on the ranks of partial matrix products, and sheds some light on global convergence or implicit regularization that have been proved or observed when optimizing linear neural networks. In passing, we provide an explicit parameterization of the set of all global minimizers and exhibit large sets of strict and non-strict saddle points.

We will study the dynamics of Gradient Flow when training a depth 2 ReLU network on univariate data in a binary classification setting. Assuming our data is labelled by a teacher network with width , we will show a convergence guarantee that holds if the learned student network is sufficiently over-parameterized. Moreover, we will show that the implicit bias in our setting is such that the student network converges to a network with at most linear regions, which results in an intuitive generalization bound. Overall, this provides an end-to-end learnability result in this setting, demonstrating the interplay between optimization and generalization: Over-parameterization facilitates optimization, while the implicit bias prevents overfitting.

Understanding the fundamental principles behind the massive success of neural networks is one of the most important open questions in deep learning. However, due to the highly complex nature of the problem, progress has been relatively slow. In this note, through the lens of infinite-width networks, a.k.a. neural kernels, we present one such principle resulting from hierarchical localities. It is well-known that the eigenstructure of infinite-width multilayer perceptrons (MLPs) depends solely on the concept frequency, which measures the order of interactions. We show that the topologies from deep convolutional networks (CNNs) restructure the associated eigenspaces into finer subspaces. In addition to frequency, the new structure also depends on the concept space, which measures the spatial distance among nonlinear interaction terms. The resulting fine-grained eigenstructure dramatically improves the network's learnability, empowering them to simultaneously model a much richer class of interactions, including Long-Range-Low-Frequency interactions, Short-Range-High-Frequency interactions, and various interpolations and extrapolations in-between. Additionally, model scaling can improve the resolutions of interpolations and extrapolations and, therefore, the network's learnability. Finally, we prove a sharp characterization of the generalization error for infinite-width CNNs of any depth in the high-dimensional setting. Two corollaries follow: (1) infinite-width deep CNNs can break the curse of dimensionality without losing their expressivity, and (2) scaling improves performance in both the finite and infinite data regimes.

Given one optimization problem, consider pairing it with another by smoothly parametrizing the domain. This is done either for practical purposes (e.g., to use smooth optimization algorithms with good guarantees) or for theoretical purposes (e.g., to reveal that the landscape satisfies a strict saddle property). In both cases, the central question is: how do the landscapes of the two problems relate? Surprisingly, the relation is often determined by the parametrization itself; it is almost entirely independent of the cost function. In this talk, I will present a new geometric framework for studying parametrizations according to their effect on landscapes. Applications include: optimization over low-rank matrices and tensors by optimizing over a factorization; the Burer-Monteiro approach to semidefinite programs; training neural networks by optimizing over their weights and biases; and quotienting out symmetries. Joint with Eitan Levin (Caltech) and Nicolas Boumal (EPFL).

Modern neural networks are usually over-parameterized—the number of parameters exceeds the number of training data. In this case the loss functions tend to have many (or even infinite) global minima, which imposes an additional challenge of minima selection on optimization algorithms besides the convergence. Specifically, when training a neural network, the algorithm not only has to find a global minimum, but also needs to select minima with good generalization among many other bad ones. In this talk, I will share a series of works studying the mechanisms that facilitate global minima selection of optimization algorithms. First, with a linear stability theory, we show that stochastic gradient descent (SGD) favors flat and uniform global minima. Then, we build a theoretical connection of flatness and generalization performance based on a special structure of neural networks. Next, we study the global minima selection dynamics—the process that an optimizer leaves bad minima for good ones—in two settings. For a manifold of minima around which the loss function grows quadratically, we derive effective exploration dynamics on the manifold for SGD and Adam, using a quasistatic approach. For a manifold of minima around which the loss function grows subquadratically, we study the behavior and effective dynamics for GD, which also explains the edge of stability phenomenon.

Dynamic processes have been modeled successfully for hundreds of years, often using ordinary or partial differential equations. Using data-driven methods, these processes can now also be inferred directly from measurements. In this talk, I will provide an overview of our work in this direction. I will discuss learning differential equations on reduced spaces, how to utilize numerical integration schemes to train neural networks for stochastic dynamics, and close with an alternative view on system identification with the Koopman operator framework.

Deep learning comes with excessive demands for data. In this talk, I will present my recent work on showing that not all data is necessary for training an accurate predictor. In particular, one can drop “easy-to-learn” examples, and do just as well as learning on all of the data. Given this disparate “importance” of training data on generalization, I will present empirical analysis of the loss landscape derived from different subsets of the training examples. I will then look into how the training dynamics are influenced by easy versus hard data.

Wasserstein GANs (WGANs) are based on the idea of minimising the Wasserstein distance between a real and a generated distribution. In this talk, we provide an in-depth mathematical analysis of differences between the theoretical setup and the reality of training WGANs. We gather both theoretical and empirical evidence that the WGAN loss is not a meaningful approximation of the Wasserstein distance. Moreover, we argue that the Wasserstein distance is not even a desirable loss function for deep generative models, and conclude that the success of WGANs can be attributed to a failure to approximate the Wasserstein distance.

This talk will discuss two aspects of optimization geared towards improving generalization in Deep Learning, involving: i) regularization effect of SGD, and ii) model averaging. The first part of the talk will describe a mechanism that captures the implicit *Regularization Effect of SGD*. We show that it depends on the Fisher Information Matrix and replicate SGD’s implicit regularization effect with an explicit form. We will then delve into OOD generalization where we demonstrate that the difference in performance of deep models on in-domain validation sets and distribution-shifted test sets varies in a chaotic manner during training using ERM. This leads to poor model selection when using in-domain validation data for early stopping, and makes the trained model unreliable. We discuss this problem and propose *Model Averaging* as an approach of mitigation, leading to state-of-the-art performance on the DomainBed benchmark. I will conclude with a preview of my future research directions in the broad area of OOD generalization. Bio:I am currently a senior research scientist at Salesforce AI Research. Prior to this, I was a postdoc at Mila with Prof. Yoshua Bengio. I received my PhD and M.S. degrees in Computer Science and Engineering from University at Buffalo, and B.S. degree in Electrical Engineering from IIT-BHU, India.I am interested in representation learning analysis and algorithms for supervised/self-supervised learning, generative modeling, and more recently out of distribution generalization. My interest also lies at the intersection of optimization and generalization in deep learning, specifically, aspects of deep learning optimization that improve generalization. I have recently also worked on time series analytics in which I designed deep learning algorithms for forecasting and anomaly detection. I am currently developing a toolbox for causal discovery for tabular and time series data, as well as conducting research in this area.

Many applications involve non-Euclidean data, such as graphs, strings, or matrices. Exploiting such geometric structure in optimization can deliver algorithms that are computationally superior to standard nonlinear programming approaches. This observation has resulted in an increasing interest in Riemannian methods in the optimization and machine learning community. In this talk, we consider the problem of optimizing a function on a (Riemannian) manifold subject to convex constraints. Several classical problems that arise in Machine Learning applications can be phrased as constrained optimization on manifolds. This includes matrix-valued tasks, such as barycenter problems, as well as the computation of Brascamp-Lieb constants. The latter is of central importance in many areas of mathematics and computer science through connections to maximum likelihood estimators in Gaussian models, Tyler’s M-estimator of scatter matrices and Operator Scaling. We introduce Riemannian Frank-Wolfe methods, a class of first-order methods for solving constrained optimization problems on manifolds and present a global, non-asymptotic convergence analysis. We further discuss a class of CCCP-style algorithms for Riemannian “difference of convex” functions and explore connections to constrained optimization. For both methods we discuss applications to the two problems described above. Based on joint work with Suvrit Sra.

We introduce and explore the problem of Selectively Forgetting a particular subset of the data used for training a deep neural network. We propose an initial method for "scrubbing'" the weights clean by minimizing the Forgetting Lagrangian, such that any probing function of the weights is indistinguishable from the same function applied to the weights of a network trained without the data to be forgotten. This method is improved upon by using a deterministic update of the linearized version of the model inspired by NTKs, which enables us to bound the extracted information in a black-box setting. This inspired us to propose Linear Quadratic Fine-tuning (LQF), which is the first method for linearizing a pre-trained model which achieves comparable performance to non-linear fine-tuning. LQF allows us to exploit the strength of deep neural networks while enjoying the theoretical properties of convex optimization. Using LQF, we introduce a novel notion of forgetting in the Mixed-Privacy setting which enables us to perform forgetting for large scale vision tasks, while providing theoretical guarantees. We further extend this mixed privacy setting to Differential Privacy (DP) and introduce AdaMix, an adaptive DP algorithm which exploits few-shot public data to improve the privacy trade-off for practical vision tasks.

In this talk I am going to talk about two problems that Message Passing Neural Networks (MPNNs) have been shown to be struggling from. The first one -- known as over-squashing -- is unavoidable in the MPNN class and concerns the input graph topology. This relates to how information propagates in a graph. We show that discrete curvature quantities (old and new) could help us understanding where messages are being lost and we can provably characterize the over-squashing phenomenon in terms of curvature. The second problem consists in analysing GNNs as multi-particle dynamics using the lens of gradient flows of an energy. We investigate what happens when instead of learning the MPNN equations we learn an energy and then let the equations follow the gradient flow of such energy. This allows us to understand further the role of the channel-mixing matrix that is ubiquitous in standard graph convolutional models as a bilinear potential inducing both attraction and repulsion along edges via its positive and negative eigenvalues respectively.

Understanding how deep neural networks generalize remains notoriously challenging in theory. This talk will motivate and examine a simpler question, that is, the generalization of high-dimensional linear regression models, using several empirical high-performing real-world neural-network-induced settings as a testbed (e.g. the empirical NTK of a pretrained ResNet applied to CIFAR-100). We find that, perhaps surprisingly, even in these linear settings, most existing theoretical analyses for linear/kernel regression fail to qualitatively capture the empirical generalization phenomena. On the other hand, a random matrix theory hypothesis gives rise to an estimator that accurately predicts generalization. Based on https://arxiv.org/abs/2203.06176 (ICML 2022). Bio: Wei Hu is a postdoc at UC Berkeley and an incoming assistant professor at the University of Michigan. He obtained his Ph.D. from Princeton University advised by Sanjeev Arora, and his bachelor's degree from Tsinghua University. He is interested in the theoretical and scientific foundations of modern machine learning, with a particular focus on deep learning.

Kernel ridge regression (KRR) has recently attracted renewed interest due to its potential for explaining the transient effects, such as double descent, that emerge during neural network training. In this work, we study how the alignment between the target function and the kernel affects the performance of the KRR. We focus on the truncated KRR (TKRR) which utilizes an additional parameter that controls the spectral truncation of the kernel matrix. We show that for polynomial alignment, there is an over-aligned regime, in which TKRR can achieve a faster rate than what is achievable by full KRR. The rate of TKRR can improve all the way to the parametric rate, while that of full KRR is capped at a sub-optimal value. This shows that target alignemnt can be better leveraged by utilizing spectral truncation in kernel methods. We also consider the bandlimited alignment setting and show that the regularization surface of TKRR can exhibit transient effects including multiple descent and non-monotonic behavior. Our results show that there is a strong and quantifable relation between the shape of the alignment spectrum and the generalization performance of kernel methods, both in terms of rates and in finite samples.

Over-parameterized machine learning models lead to excellent performance in a variety of applications. In this talk we consider the effect of over-parameterization on stochastic optimization algorithms. We discuss how over-parameterization allows us to use a constant step size within stochastic gradient methods, and that this leads to a faster convergence rate. We also present algorithms with provably-faster convergence rates in the over-parameterized setting. Finally, we discuss how over-parameterization allows us to update the learning rate during the training procedure which leads to improved performance over a variety of previous approaches.

While over-parameterization is widely believed to be crucial for the success of optimization for the neural networks, most existing theories on over-parameterization do not fully explain the reason -- they either work in the Neural Tangent Kernel regime where neurons don't move much, or require an enormous number of neurons. In this talk I will describe our recent works towards understanding training dynamics that go beyond kernel regimes with only polynomially many neurons (mildly overparametrized). In particular, we first give a local convergence result for mildly overparametrized two-layer networks. We then analyze the global training dynamics for a related overparametrized tensor model. For both works, we rely on a key intuition that neurons in overparametrized models work in groups and it's important to understand the behavior of an average neuron in the group. Based on two works: https://arxiv.org/abs/2102.02410 and https://arxiv.org/abs/2106.06573

In this talk we discuss almost sure convergence of Stochastic Gradient Descent in discrete and continuous time for a given twice continuously-differentiable target function F. In a first step we give assumptions on the step-sizes and perturbation size to ensure convergence of the target value F and gradient of F assuming that grad F is locally Hölder-continuous. This result entails convergence of the iterates itself in the case where F does not possess a continuum of critical points. In a general non-convex setting with F possibly containing a rich set of critical points, convergence of the process itself is sometimes taken for granted, but actually is a non-trivial issue as there are solutions to the gradient flow ODE for smooth target functions that stay in a compact set but do not converge. Using the Lojasiewicz-inequality we give sharp bounds on the step-sizes and the size of the perturbation in order to guarantee convergence of the SGD scheme for analytic target functions. Also, we derive the convergence rate of the function value under the assumptions that F satisfies a particular Lojasiewicz-inequality with exponent in [1/2,1). Finally, we compare the discrete and continuous time results and discuss optimality of the assumptions. This is joint work with Steffen Dereich (WWU Münster).

Wasserstein GANs (WGANs) are based on the idea of minimising the Wasserstein distance between a real and a generated distribution. In this talk, we provide an in-depth mathematical analysis of differences between the theoretical setup and the reality of training WGANs. We gather both theoretical and empirical evidence that the WGAN loss is not a meaningful approximation of the Wasserstein distance. Moreover, we argue that the Wasserstein distance is not even a desirable loss function for deep generative models, and conclude that the success of WGANs can be attributed to a failure to approximate the Wasserstein distance.

There is an increasing realization that algorithmic inductive biases are central in preventing overfitting; empirically, we often see that machine learning models trained by stochastic gradient descent (SGD) generalize well, even in the overparameterized settings and under little to no explicit regularization. This work considers this issue in arguably the most basic setting: averaged or last-iterate SGD for linear regression in the overparameterized regime. Our main result provides a sharp excess risk bound, stated in terms of the full eigenspectrum of the data covariance matrix, that reveals a bias-variance decomposition characterizing when generalization is possible: (i) the variance bound is characterized in terms of an effective dimension (specific for SGD) and (ii) the bias bound is characterized in terms of the optimal model parameter and how it aligns with the data covariance matrix. Moreover, for SGD with iterate averaging, we prove a matching lower bound (up to constant factors); for last iterate SGD (with decaying stepsizes), we provide a matching lower bound (up to constant factors) when the signal-to-noise ratio is bounded. The above results have the following additional implications:(1) We compare the implicit regularization afforded by SGD with the explicit regularization of ridge regression. We find that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances. (2) We study the effects of transfer learning with pretraining and finetuning in the presence of covariate shift. We find that for a large class of linear regression instances, transfer learning with O(N^2) source data (and scarce or no target data) is as effective as supervised learning with N target data. References:https://arxiv.org/abs/2103.12692https://arxiv.org/abs/2108.04552https://arxiv.org/abs/2110.06198

Models built with deep neural network (DNN) can handle complicated real-world data extremely well, without suffering from the curse of dimensionality or the non-convex optimization. To contribute to the theoretical understanding of deep learning, we will investigate the nonparametric aspects of DNNs by addressing the following questions: (i) what kind of data can be best learned by deep neural networks？ (ii) can deep neural networks achieve the statistical optimality? (iii) is there any algorithmic guarantee for obtaining such optimal neural networks? Our theoretical analysis applies to two most fundamental setup in practice: regression and classification.

Designing deep learning models for practical applications, including those in machine learning, is an art that often involves an expensive search over candidate architectures. We develop novel frameworks to facilitate the process of designing deep learning models via three principled approaches: optimization, numerical analysis, and statistical modeling. First, using optimization as a tool, we develop new families of momentum-based deep learning models that take advantage of momentum methods to improve the convergence speed and ability to capture long-range dependencies in models, including deep neural networks and neural ordinary differential equations. Second, we employ numerical methods such as diffusion with a source term to develop new graph neural networks with better accuracy and efficiency. Finally, we develop new generative models that help explain the attention mechanism in transformers and suggest new directions to improve the performance of these models.

Modern machine learning models, particularly those used in deep networks, are characterized by massive numbers of parameters trained on large data sets. While these large-scale models have had tremendous practical successes, developing theoretical methods that can rigorously explain when and why these models work, has been a major issue in the field. This task is even made harder by the non-convexity of the underlying learning problems. In this talk, we shed light on the theoretical understanding of the asymptotics of learning for two popular neural network models, namely, Generalized Linear Models (GLMs) and Recurrent Neural Networks (RNNs). First, we investigate the generalizability of single-layer neural networks (i.e., GLMs) over previously unseen data. We provide a general framework to characterize the asymptotic generalization error for GLMs with arbitrary non-linearities, making it applicable to regression as well as classification problems. This framework enables analyzing the effect of (i) over-parameterization and non-linearity during modeling; (ii) choices of the loss function, initialization, and regularizer during learning; and (iii) mismatch between training and test distributions. We also rigorously and analytically explain the double descent phenomenon in generalized linear models. Secondly, we explore the learning dynamics of recurrent neural networks under gradient descent. We focus on a subclass of RNNs with linear activations and provide precise reasoning for the common knowledge suggesting that RNNs do not perform well on tasks requiring long-term memory. Using recently-developed kernel regime analysis, our main result shows that linear RNNs learned from random initializations are functionally equivalent to a certain weighted 1D-convolutional network. Importantly, the weightings in the equivalent model cause an implicit bias to elements with smaller time lags in the convolution and hence shorter memory. We show that the degree of this bias depends on the variance of the transition matrix at initialization. Our theories are validated with both synthetic and real data experiments.

We examine two data science questions motivated by two biological problems: (1) modeling time varying microbial communities and (2) modeling shapes and surfaces for evolutionary applications. Question 1: Under what conditions are Bayesian methods for inference in deterministic dynamical systems with observation noise consistent. We will highlight the relation between Bayesian inference and a classical idea in dynamical systems called the thermodynamic formalism. We will provide a partial answer to Question 1 based on a variational characterization of a partition function. Question 2: Given a set of shapes the synchronization problem is to consistently register or aligning the shapes. We develop a geometric framework that characterizes the synchronization problem. We use classic tools from the theory of fibre bundles in the geometric characterization. We then generalize the graph theoretic notion of a Laplacian to quantify obstructions to synchronization, specifically we develop a twisted Hodge theory. We then state an algorithm that learns group actions and demonstrate it's efficacy by clustering the molars of primates based on their shape. We will see the clusters correspond to eating habits.

Modern neural networks often have great expressive power and can be trained to overfit the training data, while still achieving a good test performance. This phenomenon is referred to as “benign overfitting”. Recently, there emerged a line of works studying “benign overfitting” from the theoretical perspective. However, they are limited to linear models or kernel/random feature models, and there is still a lack of theoretical understanding about when and how benign overfitting occurs in neural networks. In this talk, I will give an answer to the above question. We precisely characterize the conditions under which benign overfitting can occur in training two-layer convolutional neural networks. We show that when the signal-to-noise ratio satisfies a certain condition, a two-layer CNN trained by gradient descent can achieve arbitrarily small training and test loss. On the other hand, when this condition does not hold, overfitting becomes harmful and the obtained CNN can only achieve a constant level of test loss. These together demonstrate a sharp phase transition between benign overfitting and harmful overfitting, driven by the signal-to-noise ratio. This talk is based on joint work with Yuan Cao, Mikhail Belkin, and Quanquan Gu.

The mysterious ability of neural networks to generalize is believed to stem from an implicit regularization — a tendency of gradient-based optimization to fit training data with predictors of low “complexity.” Despite vast efforts, a satisfying formalization of this intuition is lacking. In this talk I will present a series of works theoretically analyzing the implicit regularization in matrix and tensor factorizations, known to be equivalent to certain linear and non-linear neural networks, respectively. Through dynamical characterizations I will establish an implicit regularization towards low rank (for corresponding notions of rank), different from any type of norm minimization, in contrast to prior beliefs. I will then discuss implications of this finding to both theory (possible explanation for generalization over natural data) and practice (compression of neural network layers, novel regularization schemes). Overall, our results highlight the potential of ranks to explain and improve generalization in deep learning. Works covered in this talk were done in collaboration with Asaf Maman and Nadav Cohen.

In this talk, I will present some implications of the permutation-symmetry in neural networks on the shape of the loss landscape. I will first introduce a type of saddle point, so-called symmetry-induced saddles, emerging from a particular arrangement of neurons in deep neural networks. Then we will describe the precise geometry of the global minima manifold of overparameterized networks in a teacher-student setting. Counting the possible arrangements of neuron groups inside a neural network, we will give the numbers of symmetry-induced saddle manifolds and the components of the global minima manifold in terms of the student and the teacher widths. Our analysis shows that overparameterization gradually smoothens the landscape due to a faster scaling of the global minima manifold components than the symmetry-induced saddle manifolds. Yet the landscape exhibits roughness in mildly overparameterized networks; we empirically show that gradient-based training finds a zero-loss solution only for a fraction of initializations in this regime.

In the era of big data, the training data for machine learning models is commonly collected from multiple sources. Some of these might not be unreliable (noisy, corrupted, or even manipulated). Can learning algorithms overcome this an still learn classifiers of optimal accuracy and ideally fairness? In my talk, I highlight recent results from our group that establish situations in which this is possible or impossible.

In this talk, I will discuss a remarkable phenomenon of wide neural networks: transition to linearity. The phenomenon can be described as follow: as network width increases to infinity, the neural network becomes a linear model with respect to its parameters, in any O(1)-neighborhoods of the random initialization. This phenomenon underlines the widely known constancy of neural tangent kernel of certain infinitely wide neural networks. Aiming to give a more intuitive understand, I will describe in a geometric point of view, how this somehow unexpected phenomenon, as well as the constancy of NTK, arises as a natural consequence of assembling a large number of submodels.

Reinforcement learning (RL) is notoriously sample inefficient, prompting a large body of prior work to study how unsupervised pretraining can improve sample efficiency when solving downstream RL tasks. One approach is unsupervised skill discovery, a class of algorithms that learn a set of policies without access to a reward function. While prior work has shown that these methods learn skills that can accelerate downstream RL tasks, it remains unclear whether these methods always learn useful skills. What does it even mean for a skill (or set of skills) to be probably useful? In this talk, I'll share some recent work that provides some of the first answers to this question: existing methods are optimal one one sense but very suboptimal in a different sense. These results have implications for how users might want to use skill learning algorithms, provide some (surprisingly simple!) tools for analysing skill learning methods, and suggest exciting opportunities for designing new, provably-useful skill learning algorithms.

Most practical algorithms for supervised machine learning boil down to optimizing the average performance over a training dataset. However, it is increasingly recognized that although the optimization objective is the same, the manner in which it is optimized plays a decisive role in the properties of the resulting predictor. For example, when training large neural networks, there are generally many weight combinations that will perfectly fit the training data. However, gradient-based training methods somehow tend to reach those which, on the one hand, do not overfit, and on the other hand, are very brittle to adversarially crafted examples. Why the dynamics of these methods lead to such "implicit biases" is still far from being fully understood. In this talk, I'll describe several recent theoretical results related to this question, in the context of benign overfitting and adversarial examples. Based on joint work with Gal Vardi and Gilad Yehudai.

We study the convergence properties of gradient descent for training deep linear neural networks, i.e., deep matrix factorizations, by extending a previous analysis for the related gradient flow. We show that under suitable conditions on the step sizes gradient descent converges to a critical point of the loss function, i.e., the square loss in this work. Furthermore, we demonstrate that for almost all initializations gradient descent converges to a global minimum in the case of two layers. In the case of three or more layers we show that gradient descent converges to a global minimum on the manifold matrices of some fixed rank, where the rank cannot be determined a priori.

Since the proposal of the graph neural network (GNN) by Gori et al. (2005) and Scarselli et al. (2008), one of the major problems in training GNNs was their struggle to propagate information between distant nodes in the graph. We propose a new explanation for this problem: GNNs are susceptible to a bottleneck when aggregating messages across a long path. This bottleneck causes the over-squashing of exponentially growing information into fixed-size vectors. As a result, GNNs fail to propagate messages originating from distant nodes and perform poorly when the prediction task depends on long-range interaction. Initially, we hypothesized that thanks to their attention mechanism, Graph Attention Networks (GAT) are more resistant to over-squashing than GNNs that absorb incoming edges equally, such as GCN and GIN. However, we found that GAT computes a very limited kind of attention: the ranking of the attention scores is unconditioned on the query node. We formally define this restricted kind of attention as static attention and distinguish it from a strictly more expressive dynamic attention. Because GATs use a static attention mechanism, there are simple graph problems that GAT cannot express. To remove this limitation, we introduce a simple fix by modifying the order of operations and propose GATv2, which is now available as part of the PyTorch Geometric library, the Deep Graph Library, and the TensorFlow GNN library. The talk will focus on the following papers:On the Bottleneck of Graph Neural Networks and its Practical Implications (ICLR'2021): https://arxiv.org/pdf/2006.05205How Attentive are Graph Attention Networks? (ICLR'2022): https://arxiv.org/pdf/2105.14491

Since the celebrated works of Russo and Zou (2016, 2019) and Xu and Raginsky (2017), it has been well known that the generalization error of supervised learning algorithms can be bounded in terms of the mutual information between their input and the output, given that the loss of any fixed hypothesis has a subgaussian tail. In this work, we generalize this result beyond the standard choice of Shannon's mutual information to measure the dependence between the input and the output. Our main result shows that it is indeed possible to replace the mutual information by any strongly convex function of the joint input-output distribution, with the subgaussianity condition on the losses replaced by a bound on an appropriately chosen norm capturing the geometry of the dependence measure. This allows us to derive a range of generalization bounds that are either entirely new or strengthen previously known ones. Examples include bounds stated in terms of p-norm divergences and the Wasserstein-2 distance, which are respectively applicable for heavy-tailed loss distributions and highly smooth loss functions. Our analysis is entirely based on elementary tools from convex analysis by tracking the growth of a potential function associated with the dependence measure and the loss function.

Feedforward neural networks explicitly describe a series of computations to be done given an input data point b. Implicit networks, on the other hand, describe a condition which must be met, e.g. a fixed point equation depending on b. Recent work has shown implicit networks can match the performance of feedforward networks on common machine learning tasks such as image recognition, while training using a fraction of the memory. In this talk we discuss recent advances in training implicit neural networks, as well as novel applications to problems in which the target output can naturally be thought of as a fixed point (e.g. game theory).

Understanding the properties of neural networks trained via gradient descent is at the heart of the theory of deep learning. In this talk, I will discuss two approaches to study the behavior of gradient descent methods. The first one takes a mean-field view and it relates the dynamics of stochastic gradient descent (SGD) to a certain Wasserstein gradient flow in probability space. I will show how this idea allows to study the connectivity, convergence and implicit bias of the solutions found by SGD. The second approach consists in the analysis of the Neural Tangent Kernel. I will present tight bounds on its smallest eigenvalue and show their implications on memorization and optimization in deep networks. Based on joint work with Adel Javanmard, Vyacheslav Kungurtsev, Andrea Montanari, Guido Montufar, Quynh Nguyen, and Alexander Shevchenko.

Deep reinforcement learning has enabled machines to solve complex control problems directly from high-dimensional camera inputs. However, these systems rely on carefully designed reward functions for specific tasks. On the other hand, humans learn about the world and perform complex behaviors without any external reward signal. We categorize the space of possible objective functions for embodied agents. We show a spectrum that reaches from narrow to general objectives. While the narrow objectives correspond to domain-specific rewards as typically used in reinforcement learning today, the general objectives correspond to information maximization through world models. This explains unsupervised learning, perception, exploration, skill discovery, and control from a single principle. Our findings suggest designing powerful world models as a path toward building highly adaptive agents that seek out large niches in their environments, rendering task rewards optional.

Training of neural networks is hard to describe theoretically due to complicated non-linear dependence of network predictions on parameters. However, the situation greatly simplifies in the limit of infinite network width, where the problem becomes quadratic with the matrix given by Neural Tangent Kernel (NTK). Such problems are more amenable to theoretical analysis and mostly described by spectral properties of linear operator and target function. In the first part of the talk, we will show that in certain scenarios spectrum of the NTK and eigendecomposition of target function are asymptotically described by power laws with simple explicit expression for their exponents. In the second part of the talk we will turn to general quadratic problems with power-law spectrum and give tight bounds for convergence speed of various Gradient Descent algorithms: vanilla Gradient Descent (GD), Heavy Ball (HB) method, GD and HB with predefined schedules, Steepest Descent and Conjugate Gradients. The talk is based on the joint work with Dmitry Yarotsky (arXiv:2105.00507 and arXiv:2202.00992).

We characterize the power-law asymptotics of learning curves for Gaussian process regression (GPR) under the assumption that the eigenspectrum of the prior and the eigenexpansion coefficients of the target function follow a power law. Under similar assumptions, we leverage the equivalence between GPR and kernel ridge regression (KRR) to show the generalization error of KRR. Infinitely wide neural networks can be related to GPR with respect to the neural network GP kernel and the neural tangent kernel, which in several cases is known to have a power-law spectrum. Hence our methods can be applied to study the generalization error of infinitely wide neural networks. We present toy experiments demonstrating the theory. This is a joint work with Pradeep Banerjee and Guido Montufar.

Inspired by the proposal of tangent kernels of neural networks (NNs), a recent research line aims to design kernels with a better generalization performance on standard datasets. Indeed, a few recent works showed that certain kernel machines perform as well as NNs on certain datasets, despite their separations in specific cases implied by theoretical results. Furthermore, it was shown that the induced kernels of convolutional neural networks perform much better than any former handcrafted kernels. These empirical results pose a theoretical challenge to understanding the performance gaps in kernel machines and NNs in different scenarios.
In this talk, we show that data structures play an essential role in inducing these performance gaps. We consider a few natural data structures, and study their effects on the performance of these learning methods. Based on a fine-grained high dimensional asymptotics framework of analyzing random features models and kernel machines, we show the following: 1) If the feature vectors are nearly isotropic, kernel methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best low-dimensional representation; 2) If the feature vectors display the same low-dimensional structure as the target function (the spiked covariates model), this curse of dimensionality becomes milder, and the performance gap between kernel methods and NNs become smaller; 3) On datasets that display some invariance structure (e.g., image dataset), there is a quantitative performance gain of using invariant kernels (e.g., convolutional kernels) over inner product kernels. Beyond explaining the performance gaps, these theoretical results can further provide some intuitions towards designing kernel methods with better performance.

Multi-armed bandit algorithms minimize experimentation costs required to converge on optimal behavior. They do so by rapidly adapting experimentation effort away from poorly performing actions as feedback is observed. But this desirable feature makes them sensitive to confounding, which is the primary concern underlying classical randomized controlled trials. We highlight, for instance, that popular bandit algorithms cannot address the problem of identifying the best action when day-of-week effects may confound inferences. In response, this paper proposes deconfounded Thompson sampling, which makes simple, but critical, modifications to the way Thompson sampling is usually applied. Theoretical guarantees suggest the algorithm strikes a delicate balance between adaptivity and robustness to confounding. It attains asymptotic lower bounds on the number of samples required to confidently identify the best action --- suggesting optimal adaptivity --- but also satisfies strong performance guarantees in the presence of day-of-week effects and delayed observations --- suggesting unusual robustness. These issues are explored through a very general model of contextual bandit experiments. At the core of the paper is a new model of contextual bandit experiments in which issues of delayed learning and distribution shift arise organically.

Deep learning is nowadays used in a wide range of highly complex applications such as natural language processing or even scientific applications. Its first major breakthrough, however, was achieved by shattering the state-of-the-art in image classification. We revisit the problem of classification by deep neural networks and attempt to find an answer to why deep networks are remarkably effective in this regime. We will interpret the learning of classifiers as finding piecewise constant functions from labelled samples. We then precisely link the hardness of the learning problem to the complexity of the regions. Concretely, we will establish fundamental lower bounds on the learnability of certain regions. Finally, we will show that in many cases, these optimal bounds can be achieved by deep-neural-network-based learning.
In quite realistic settings, we will observe that deep neural networks can learn high-dimensional classifiers without a strong dependence of the learning rates on the dimension.

Adaptive gradient methods such as Adam have gained increasing popularity in deep learning optimization. However, it has been observed that compared with (stochastic) gradient descent, Adam can converge to a different solution with a significantly worse test error in many deep learning applications such as image classification, even with a fine-tuned regularization. In this talk, I will discuss some recent results on the convergence and generalization of Adam in training neural networks, and give a theoretical explanation for the difference between Adam and SGD. I will also present a new deep learning optimizer called the partially adaptive momentum estimation method, which achieves faster convergence rates and smaller test errors than Adam and SGD with momentum on various deep learning tasks.
Bio:
Yuan Cao is an assistant professor in the Department of Statistics and Actuarial Science and Department of Mathematics at the University of Hong Kong. Before joining HKU, he was postdoctoral scholar at UCLA working with Professor Quanquan Gu. He received his B.S. from Fudan University and Ph.D. from Princeton University. Yuan’s research interests include the theory of deep learning, non-convex optimization, and high-dimensional statsitcs. He has published research papers in top machine learning journals (ML) and conferences (NeurIPS, ICML, AAAI, IJCAI, etc.), including a spotlight presentation in NeurIPS 2019 and a long talk in ICML 2021.

I would demonstrate that a neural network (NN) learns training data as simple as it can, resembling an implicit Occam's Razor, from the following two viewpoints. First, the NN output often follows a frequency principle, i.e., learning data from low to high frequency. Second, we prove an embedding principle that the loss landscape of a NN "contains" all the critical points of all the narrower NNs. The embedding principle provides a basis for the condensation phenomenon, i.e., the NN weights condense on isolated directions when initialized small, which means the effective NN size is much smaller than its actual size, i.e., a simple representation of the training data.
Bio:
Zhi-Qin John Xu is an associate professor at Shanghai Jiao Tong University (SJTU). Zhi-Qin obtain B.S. in Physics (2012) and a Ph.D. degree in Mathematics (2016) from SJTU. Before joining SJTU, Zhi-Qin worked as a postdoc at NYUAD and Courant Institute from 2016 to 2019.

This talk will discuss certain peculiarities of training neural networks with batch normalization and weight decay, which has become a common practice in recent years. It turns out that their combined use may result in a surprising periodic behavior of optimization dynamics: the training process regularly exhibits destabilizations that, however, do not lead to complete divergence but cause a new period of training. We will delve deeper into the mechanism underlying the discovered periodic behavior, both empirically and theoretically, and analyze the conditions in which it occurs in practice. We will also demonstrate that periodic behavior can be regarded as a generalization of two previously opposing perspectives on training with batch normalization and weight decay, namely the equilibrium presumption and the instability presumption.

We propose a new geometric method for measuring the quality of representations obtained from deep learning. Our approach, called "Random Polytope Descriptor", provides an efficient description of data points based on the construction of random convex polytopes. We demonstrate the use of our technique by qualitatively comparing the behavior of classic and regularized autoencoders. This reveals that applying regularization to autoencoder networks may decrease the out-of-distribution detection performance in latent space. While our technique is similar in spirit to k-means clustering, we achieve significantly better false positive/negative balance in clustering tasks on autoencoded datasets. Joint work with Marek Kaluba and Lukas Ruff.

Deep learning relies on Artificial Neural Networks (ANNs) with deep architectures—machine learning models that have reached unparalleled performance in many domains, such as machine translation, autonomous vehicles, computer vision, text generation, and speech understanding. However, this impressive performance typically requires large datasets and massive ANN models. Gathering the data and training the models—all can take a long time and have prohibitive costs. Significant research efforts are being invested in improving ANN training efficiency, i.e. the time, data, and resources required to train these models. For example, this is done by changing the model (e.g., architecture, numerical precision) or the training algorithm (e.g., parallelization). However, such modifications often cause an unexplained degradation in the generalization performance of the ANN to unseen data. Recent findings suggest that this degradation is caused by changes to the hidden algorithmic bias of the training algorithm and model. This bias determines which solution is selected from all solutions which fit the data. I will discuss my works focused on understanding and controlling such algorithmic bias.
Bio:
Daniel is an associate professor in the Department of Electrical Engineering at the Technion, working in the areas of machine learning and theoretical neuroscience. He did his post-doc (as a Gruss Lipper fellow) working with Prof. Liam Paninski in the Department of Statistics and the Center for Theoretical Neuroscience at Columbia University. He is interested in all aspects of neural networks and deep learning. His recent works focus on quantization, resource efficiency, and implicit bias in neural networks. He is the recipient of the Gruss Lipper fellowship, the Goldberg Award, and Intel's Rising Star Faculty Award.

We contribute to a better understanding of the class of functions that is represented by a neural network with ReLU activations and a given architecture. Using techniques from mixed-integer optimization, polyhedral theory, and tropical geometry, we provide a mathematical counterbalance to the universal approximation theorems which suggest that a single hidden layer is sufficient for learning tasks. In particular, we investigate whether the class of exactly representable functions strictly increases by adding more layers (with no restrictions on size). We also present upper bounds on the sizes of neural networks required for exact function representation. This is joint work with Amitabh Basu, Marco Di Summa, and Martin Skutella.

Owing to the rapid development of big data and computing technology, deep learning method have shown successful results in various application fields. In view of theoretical studies, however, the deep neural networks are still trained by using the error backpropagation method, which has been used for training conventional multilayer perceptron, and its undesirable learning behaviors such as plateau are inherently maintained. Moreover, as the complexity of the model increases, it is likely to that such undesirable phenomena appear in various forms, making it difficult to fully understand their properties. In this talk, I introduce an approach to understand the strange learning behavior of neural networks, focusing on the complex singular structure of the neuromanifold. In order to investigate the effect of singularities on learning dynamics, we define three different types of learning scenarios according to the positional relationship between optimal and singular points. In these scenarios, we trace evolutions of generalization error during learning by using statistical mechanical method, and show that the three learning scenarios have different dynamical properties. Especially, in the near-singular scenario with over-parameterized model, we reveal the quasi-plateau phenomena, which are different type of slow dynamics from the well-known plateau. Through further analysis on average learning equations around the singular points, we show that there are two different types of slow manifolds associated with plateau and quasi-plateau, respectively. Additionally, we also show that the natural gradient learning, which is developed by considering geometrical structure, can alleviate the slow convergence caused by plateaus and quasi-plateaus.

Despite their outstanding performance, deep neural networks (DNNs) are susceptible to adversarial examples, imperceptibly perturbed examples causing mis-classification. Similarly, but less studied, DNNs are fragile in terms of perturbations in their weights. This talk highlights my recent research on both input and weight robustness and investigates how both problems are related. On the subject of adversarial examples, I discuss a confidence-calibrated version of adversarial training that allows to obtain robustness beyond the adversarial perturbations seen during training. Next, regarding weight robustness, I address robustness against random bit errors in the (quantized) weights which plays an important role in improving the energy-efficiency of DNN accelerators. Surprisingly, improved weight robustness can also be beneficial in terms of robustness against adversarial examples. Specifically, weight robustness can be thought of as flatness in the loss landscape with respect to perturbations of the weights. Using an intuitive flatness measure for adversarially trained DNNs, I demonstrate that flatness in the weight loss landscape improves adversarial robustness and helps to avoid robust overfitting.
Bio:
David Stutz is a final-year PhD student at the Max Planck Institute for Informatics supervised by Prof. Bernt Schiele and co-supervised by Prof. Matthias Hein from the University of Tübingen. He obtained his bachelor and master degrees in computer science from RWTH Aachen University. During his studies, he completed an exchange program with the Georgia Institute of Technology as well as several internships at Microsoft, Fyusion and Hyundai MOBIS, among others. He wrote his master thesis at the Max Planck Institute for Intelligent Systems supervised by Prof. Andreas Geiger. His PhD research focuses on obtaining robust deep neural networks, considering adversarial examples, corrupted examples or out-of-distribution examples. In a collaboration with IBM Research, subsequent work improves robustness against bit errors in (quantized) weights to enable energy-efficient and secure accelerators. This work was awarded an outstanding paper award at the CVPR CV-AML Workshop 2021. More recently, during an internship at DeepMind, he used conformal prediction for uncertainty estimation in medical diagnosis. He received several awards and scholarships including the Qualcomm Innovation Fellowship, RWTH Aachen University's Springorum Denkmünze and the STEM Award IT sponsored by ZF Friedrichshafen. His work has been published at top venues in computer vision and machine learning including ICCV, CVPR, IJCV, ICML and MLSys. More information can be found at www.davidstutz.de.

We consider approximate Bayesian model choice for model selection problems that involve models whose Fisher information matrices may fail to be invertible along other competing submodels. Such singular models do not obey the regularity conditions underlying the derivation of Schwarz's Bayesian information criterion BIC and the penalty structure in BIC generally does not reflect the frequentist large sample behaviour of the marginal likelihood. Although large sample theory for the marginal likelihood of singular models has been developed recently, the resulting approximations depend on the true parameter value and lead to a paradox of circular reasoning. Guided by examples such as determining the number of components in mixture models, the number of factors in latent factor models or the rank in reduced rank regression, we propose a resolution to this paradox and give a practical extension of BIC for singular model selection problems.

Optimization of many deep learning hyperparameters can be formulated as a bilevel optimization problem. While most black-box and gradient-based approaches require many independent training runs, we aim to adapt hyperparameters online as the network trains. The main challenge is to approximate the response Jacobian, which captures how the minimum of the inner objective changes as the hyperparameters are perturbed. To do this, we introduce the self-tuning network (STN), which fits a hypernetwork to approximate the best response function in the vicinity of the current hyperparameters. Differentiating through the hypernetwork lets us efficiently approximate the gradient of the validation loss with respect to the hyperparameters. We train the hypernetwork and hyperparameters jointly. Empirically, we can find hyperparameter settings competitive with Bayesian Optimization in a single run of training, and in some cases find hyperparameter schedules that outperform any fixed hyperparameter value.

One fundamental problem in deep learning is understanding the excellent performance of deep Neural Networks (NNs) in practice. An explanation for the superiority of NNs is that they can realize a large family of complicated functions, i.e., they have powerful expressivity. The expressivity of a Neural Network with Piecewise Linear activations (PLNN) can be quantified by the maximal number of linear regions it can separate its input space into. In this talk, we provide several mathematical results needed for studying the linear regions of Piecewise Linear Convolutional Neural Networks (PLCNNs), and use them to derive the maximal and average numbers of linear regions for one-layer PLCNNs. Furthermore, we obtain upper and lower bounds for the number of linear regions of multi-layer PLCNNs. Rectified Linear Unit (ReLU) is a piecewise linear activation function that has been widely adopted in various architectures. Our results suggest that deeper ReLU CNNs have more powerful expressivity than their shallow counterparts, while ReLU CNNs have more expressivity than fully-connected ReLU NNs per parameter, in terms of the number of linear regions.
Bio:
Dr. Huan Xiong is an Assistant Professor at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI). He received the B.S. and M.S. degrees from the School of Mathematical Sciences, Peking University, China, in 2010 and 2013, respectively, and the Ph.D. degree from the University of Zurich, Switzerland, in 2016. He was a postdoctoral researcher at the University of Strasbourg, France. His research interests include machine learning and discrete mathematics. He has published over 30 papers in top conferences/journals such as ICML, NeurIPS, CVPR and TPAMI.

The posterior over Bayesian neural network (BNN) parameters is extremely high-dimensional and non-convex. For computational reasons, researchers approximate this posterior using inexpensive mini-batch methods such as mean-field variational inference or stochastic-gradient Markov chain Monte Carlo (SGMCMC). To investigate foundational questions in Bayesian deep learning, we instead use full-batch Hamiltonian Monte Carlo (HMC) on modern architectures. We show that (1) BNNs can achieve significant performance gains over standard training and deep ensembles; (2) a single long HMC chain can provide a comparable representation of the posterior to multiple shorter chains; (3) in contrast to recent studies, we find posterior tempering is not needed for near-optimal performance, with little evidence for a "cold posterior" effect, which we show is largely an artifact of data augmentation; (4) BMA performance is robust to the choice of prior scale, and relatively similar for diagonal Gaussian, mixture of Gaussian, and logistic priors; (5) Bayesian neural networks show surprisingly poor generalization under domain shift; we demonstrate, explain and provide remedies for this effect; (6) while cheaper alternatives such as deep ensembles and SGMCMC methods can provide good generalization, they provide distinct predictive distributions from HMC. Notably, deep ensemble predictive distributions are similarly close to HMC as standard SGLD, and closer than standard variational inference.
References
I will talk about two of our recent papers:
[1] What Are Bayesian Neural Network Posteriors Really Like? Pavel Izmailov, Sharad Vikram, Matthew D. Hoffman, Andrew Gordon Wilson
[2] Dangers of Bayesian Model Averaging under Covariate Shift: Pavel Izmailov, Patrick Nicholson, Sanae Lotfi, Andrew Gordon Wilson
Other useful references:
[3] Bayesian Deep Learning and a Probabilistic Perspective of Generalization: Andrew Gordon Wilson, Pavel Izmailov
[4] How Good is the Bayes Posterior in Deep Neural Networks Really? Florian Wenzel, Kevin Roth, Bastiaan S. Veeling, Jakub Świątkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, Sebastian Nowozin
[5] Bayesian Neural Network Priors Revisited: Vincent Fortuin, Adrià Garriga-Alonso, Florian Wenzel, Gunnar Rätsch, Richard Turner, Mark van der Wilk, Laurence Aitchison

Relative information (relative entropy, KL divergence) and variational inference are powerful tools for deriving learning algorithms and their asymptotic properties, for both static systems and dynamic systems. The goal of this talk is to motivate a general online stochastic learning algorithm for stochastic processes with latent variables or memory, that provably converges under some regularity conditions. Please visit https://bit.ly/3kmovql for details.
In the first half of the talk, we study static systems, viewing maximum likelihood and Bayesian inference through the lens of relative information. In particular, their generalization errors may be derived by resolving the singularities of relative information. We then frame the two learning algorithms as special cases of variational inference with different computational constraints.
In the second half of the talk, we study dynamic systems, extending this variational inference method and computational perspective to stochastic processes and online learning. In particular, the training objective function will be a form of relative information which can be optimized iteratively in a way similar to expectation-maximization. The relative information objective provides a precise way to discuss the trade-off between exploration and exploitation during training.

In this talk, I will present some of my work on the functional space associated with neural networks. I will focus on simple classes of networks, including feedforward networks with linear and polynomial activations and two-layer ReLU networks, that provide a tractable setting where many geometric properties of general networks can be studied in detail. In particular, I will emphasize the distinction between the intrinsic function space and its parameterization, in order to shed light on the impact of the architecture on the expressivity of a model and on the corresponding optimization landscapes.
Work done outside Amazon.

Often, the training of many machine learning models can be formulated as a single-objective optimization (minimization) problem, which can be efficiently solved by gradient-based optimization methods. However, there is a growing number of models that involve multiple interacting objectives. Differentiable game is a generalization of the standard single-objective optimization framework, allowing us to model multiple players and objectives. However, new issues and challenges arise in solving differentiable games. Standard gradient descent-ascent algorithm could either converge to non-equilibrium fixed points or converge with a slow rate (if it does converge). In this talk, I will present some of my work attacking both problems. Besides, I will introduce a unified and systematic framework for global convergence analysis of first-order methods in solving strongly-monotone games.

There has been enormous progress in the last few years in designing conceivable (though not always practical) neural networks that respect the gauge symmetries - or coordinate freedom - of physical law. Some of these frameworks make use of irreducible representations, some make use of higher order tensor objects, and some apply symmetry-enforcing constraints. Different physical laws obey different combinations of fundamental symmetries, but a large fraction (possibly all) of classical physics is equivariant to translation, rotation, reflection (parity), boost (relativity), and permutations. Here we show that it is simple to parameterize universally approximating polynomial functions that are equivariant under these symmetries, or under the Euclidean, Lorentz, and Poincaré groups, at any dimensionality d. The key observation is that nonlinear O(d)-equivariant (and related-group-equivariant) functions can be expressed in terms of a lightweight collection of scalars -- scalar products and scalar contractions of the scalar, vector, and tensor inputs. These results demonstrate theoretically that gauge-invariant deep learning models for classical physics with good scaling for large problems are feasible right now.

Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another. In particular, when the architectures are identical, this is called self-distillation. The idea is to feed in predictions of the trained model as new target values for retraining (and iterate this loop possibly a few times). It has been empirically observed that the self-distilled model often achieves higher accuracy on held out data. Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training. This talk will provide a rigorous theoretical analysis of self-distillation. We focus on fitting a nonlinear function to training data, where the model space is Hilbert space and fitting is subject to L2 regularization in this function space. We show that self-distillation iterations modify regularization by progressively limiting the number of basis functions that can be used to represent the solution. This implies (as we also verify empirically) that while a few rounds of self-distillation may reduce over-fitting, further rounds may lead to under-fitting and thus worse performance.
This is joint work with Mehrdad Farajtabar (DeepMind) and Peter Bartlett (UC Berkeley).
Bio:
Hossein Mobahi is a research scientist at Google Research. His current interests relate to the theory of deep learning. Prior to joining Google in 2016, he was a postdoctoral researcher in CSAIL at MIT. He obtained his PhD in Computer Science from the University of Illinois at Urbana-Champaign (UIUC). He has co-organized the ICML 2018 Workshop “Modern Trends in Nonconvex Optimization for Machine Learning”, ICML 2019 Workshop “Understanding and Improving Generalization in Deep Learning”, and NeurIPS 2020 Competition “Predicting Generalization in Deep Learning”.

Optimization of many deep learning hyperparameters can be formulated as a bilevel optimization problem. While most black-box and gradient-based approaches require many independent training runs, we aim to adapt hyperparameters online as the network trains. The main challenge is to approximate the response Jacobian, which captures how the minimum of the inner objective changes as the hyperparameters are perturbed. To do this, we introduce the self-tuning network (STN), which fits a hypernetwork to approximate the best response function in the vicinity of the current hyperparameters. Differentiating through the hypernetwork lets us efficiently approximate the gradient of the validation loss with respect to the hyperparameters. We train the hypernetwork and hyperparameters jointly. Empirically, we can find hyperparameter settings competitive with Bayesian Optimization in a single run of training, and in some cases find hyperparameter schedules that outperform any fixed hyperparameter value.

It is well-known that learning statistical associations from finite data requires regularization to avoid overfitting. In other words, regularization terms penalize too complex functions to lower the risk that the functions capture random noise in the data. However, in the limit of infinite sample size, one can still learn arbitrarily complex statistical relations. I argue that regularization is even recommended in the population limit if one is interested in a causal model rather than a statistical model. This is because regularization can also mitigate bias from hidden common causes. This can be seen for a simple linear and non-linear regression task, where I show a very explicit formal analogy between finite sample and confounding bias. My theoretical results suggest that learning causal relations in the presence of hidden common causes should use particularly simple models.
Paper: D. Janzing: Causal regularization, NeurIPS 2019.

Wasserstein distances has recently seen a surge of applications in statistics and machine learning. This stems from many advantageous properties they possess, such as metric and topological structure of Wasserstein spaces, robustness to support mismatch, compatibility to gradient-based optimization, and rich geometric properties. In practice, we rarely have access to the actual distribution and only get data from it, which necessitates estimating the distance from samples. A central issue is that such estimators suffer from the curse of dimensionality: their empirical convergence rate scales as n^{-1/d} for d-dimensional distributions. This rate deteriorates exponentially fast with dimension, making it impossible to obtain meaningful accuracy guarantees, especially given the dimensionality of real-world data. This talk will present a novel framework of smooth p-Wasserstein distances, that inherits the properties of their classic counterparts while alleviating the empirical curse of dimensionality. Specifically, we will show that the empirical approximation error of the smooth distance decays as n^{-1/2}, in all dimensions. For the special case of the smooth 1-Wasserstein distance, we will also derive a high-dimensional limit distribution, further highlighting the favorable statistical behavior of the smooth framework. The derivation of statistical efficiency for the general pth order distance employs a new comparison result to the smooth dual Sobolev norm. Applications to implicit generative modeling will be considered, leveraging the parametric empirical convergence rates to establish n^{-1/2} generalization bounds in any dimension. Bio: Ziv Goldfeld is an assistant professor in the School of Electrical and Computer Engineering, and a graduate field member in Computer Science and the Center of Applied Mathematics, at Cornell University. Before joining Cornell, he was a postdoctoral research fellow in LIDS at MIT, hosted by Yury Polyanskiy. Ziv graduated with a B.Sc., M.Sc., and Ph.D. (all summa cum laude) in Electrical and Computer Engineering from Ben Gurion University, Israel. His graduate advisor was Haim Permuter. Ziv's research interests include optimal transport theory, statistical learning theory, information theory, and high-dimensional and nonparametric statistics. He seeks to understand and design inference and information processing systems by formulating and solving mathematical models. Honors include the NSF CAREER Award, NSF CRII Award, the IBM University Award, and the Rothschild Postdoctoral Fellowship.

In this talk I will introduce the concept of algorithmic recourse, which aims to help individuals affected by an unfavorable algorithmic decision to recover from it. First, I will show that while the concept of algorithmic recourse is strongly related to counterfactual explanations, existing methods for the later do not directly provide practical solutions for algorithmic recourse, as they do not account for the causal mechanisms covering the world. Then, I will show theoretical results that prove the need of complete causal knowledge to guarantee recourse and show how algorithmic recourse can be useful to provide novel fairness definitions that short the focus from the algorithm to the data distribution. Such novel definition of fairness allows us to distinguish between situations where unfairness can be better addressed by societal intervention, as opposed to changes on the classifiers. Finally, I will show practical solutions for (fairness in) algorithmic recourse, in realistic scenarios where the causal knowledge is only limited.

The early phase of training of deep neural networks has a dramatic and counterintuitive effect on the local curvature of the loss function. For instance, we found that using a small learning rate does not guarantee stable optimization because the optimization trajectory has a tendency to steer towards regions of the loss surface with increasing local curvature. It is equally surprising that using a small learning rate impacts negatively generalization. I will discuss our journey in understanding these and other phenomena. The focus of the talk will be our mechanistic explanation for how using a large learning rate impacts generalization, which we corroborate by developing an explicit regularizer that reproduces its implicit regularization effects [1,2].
[1] The Break-Even Point on Optimization Trajectories of Deep Neural Networks, Jastrzebski et al, ICLR 2020
[2] Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization, Jastrzebski et al, ICML 2021

Statistical inverse problems lead to complex optimisation and/or Monte Carlo sampling problems. Gradient descent and Langevin samplers provide examples of widely used algorithms. In my talk, I will discuss recent results on sampling algorithms, which can be viewed as interacting particle systems, and their mean-field limits. I will highlight the geometric structure of these mean-field equations within the, so called, Otto calculus, that is, a gradient flow structure in the space of probability measures. Affine invariance is an important outcome of recent work on the subject, a property shared by Newton’s method but not by gradient descent or ordinary Langevin samplers. The emerging affine invariant gradient flow structures allow us to discuss coupling-based Bayesian inference methods, such as the ensemble Kalman filter, as well as invariance-of-measure-based inference methods, such as preconditioned Langevin dynamics, within a common mathematical framework. Applications include nonlinear and logistic regression.

A popular approach to learning encoders for lossy compression is to use additive uniform noise during training as a differentiable approximation to test-time quantization. We demonstrate that a uniform