Two-timescale learning algorithms are often applied in game theory and bi-level optimization, using distinct update rates for two interdependent processes. In this talk, I will discuss the convergence of two types of two-timescale algorithms.
The first type is the unified two-timescale Q-learning algorithm by Angiuli et al., effective for solving mean field game (MFG) and mean field control (MFC) problems by adjusting the learning rate ratio for mean field distribution and Q-functions. We provide a theoretical explanation for the algorithm’s bifurcated outcomes under fixed learning rates, contributing a Lyapunov function that integrates mean field distribution and Q-function iterates. This function ensures unified convergence across all learning rates under mild assumptions.
The second type is the two-timescale gradient descent-ascent (GDA) algorithm, designed to find Nash equilibria in min-max games with improved convergence properties. Using a PDE-inspired approach, we analyze convergence in both finite- and infinite-dimensional cases. For finite-dimensional quadratic min-max games, we examine long-time convergence in near quasi-static regimes through hypocoercivity. For mean-field GDA dynamics, we study convergence under finite-scale ratios using a mixed synchronous-reflection coupling technique.
In this talk, we introduce a novel capacity measure 2sED for statistical models based on the effective dimension. The new quantity provably bounds the generalization error under mild assumptions on the model. Furthermore, simulations on standard data sets and popular model architectures show that 2sED correlates well with the training error. For Markovian models, we show how to efficiently approximate 2sED from below through a layerwise iterative approach, which allows us to tackle deep learning models with a large number of parameters. Simulation results suggest that the approximation is good for different prominent models and data sets.
The characterization of the structure of the manifold of low-energy lying states in neural networks is among the most fundamental theoretical questions in machine learning. In recent years, many empirical studies on the landscape of neural networks and constraint satisfaction problems (CSP) have shown that the low-lying configurations are often found in complex connected structures, where zero-energy paths between pairs of distant solutions can be constructed.
In this talk, I will discuss the geometrical organization of solutions in two prototypical problems, the "binary" and the "negative" perceptron. I will show that wide flat minima arise as complex extensive structures from the coalescence of minima around ``high-margin'' (i.e. locally robust) configurations. In the binary perceptron, in turn, the constraint density at which a hard phase for algorithms emerges can be predicted by the disappearance of the widest and flattest minima.
Finally, I will introduce a novel analytical method for characterizing the typical energy barriers between two configurations sampled from the zero-temperature Gibbs measure of the problem. This approach will provide a detailed characterization of the shape of the solution space of the negative perceptron and one-hidden layer neural networks in the overparameterized regime.
The dependence structure of high-dimensional distributions is often modeled by graphical models. The problem of learning the graph underlying such distributions has received a lot of attention in statistics and machine learning. In problems of very high dimension, it is often too costly even to store the sample covariance matrix. We propose a model in which one can query single entries of the covariance matrix. We construct efficient algorithms for structure recovery in Gaussian graphical models with query complexity that is quasi-linear in the dimension. We present algorithms that work for trees and, more generally, for graphs of small treewidth. We also discuss hypothesis testing of properties of the underlying graph. The talk is based on joint work with Luc Devroye, Jakub Truszkowski, Vasiliki Velona, and Piotr Zwiernik.
As synthetic data becomes higher quality and proliferates on the internet, machine learning models are increasingly trained on a mix of human- and machine-generated data. Despite the successful stories of using synthetic data for representation learning, using synthetic data for generative model training creates “self-consuming loops” which may lead to training instability or even collapse, unless certain conditions are met. We provide an overview of research in this new and growing field, which started in July 2023 when a group of researchers showed that "Self Consuming Generative Models Go Mad." Since then, there have been many papers that study this self-consuming loop problem, but there have been very few works which try to solve it. We'll discuss many of these papers, including our recent work, "Self-Correcting Self-Consuming Loops for Generative Model Training" (ICML 2024). Our paper aims to stabilize self-consuming generative model training by introducing a "correction" function, which ensures that the synthetic training data is higher quality. We empirically validate the effectiveness of self-correcting self-consuming loops on the challenging human motion synthesis task, and observe that it successfully avoids model collapse, even when the ratio of synthetic data to real data is as high as 100%.
As Generative AI technologies are revolutionizing industries and our daily lives, what is going to happen to the role of the mathematician? In this talk, I will highlight recent breakthroughs in deep learning and AI and explore how current and future advancements might alter the way we do mathematics.
We highlight a perhaps important but hitherto unobserved insight: The attention module in a ReLU-transformer is a cubic spline. Viewed in this manner, this mysterious but critical component of a transformer becomes a natural development of an old notion deeply entrenched in classical approximation theory. Conversely, if we assume the Pierce--Birkhoff conjecture, then every spline is also an encoder. This gives a satisfying answer to the mathematical structure of ReLU-transformers.
In this talk, we will see how concentration inequalities for the sample mean, like those due to Bernstein and Hoeffding, are valid for any sample size but overly conservative, yielding unnecessarily wide confidence intervals. Motivated by applications to reinforcement learning we develop new results on transport distances and obtain new computable concentration inequalities with asymptotically optimal size, finite-sample validity, and sub-Gaussian decay. Beyond empirical averages, we will tie this to the concept of Gaussian universality that has been recently observed to hold for a large class of estimators. We will precisely determine the class of functions and, hence, estimators, for which it holds. Beyond the independent setting we will see how it holds for a class of dependent processes and see how this can allow us to gain insight into the behavior of data augmentation. Time permitting we will finish the talk by investigating what happens with structured data in the context of graph neural networks.
Despite the great practical success of deep learning, our theoretical understanding of the generalization of neural networks is still limited. Theoretical analysis can be facilitated by studying "linearized" neural networks, for example through neural tangent kernel theory. While this allows to leverage some classical theory for regularized kernel methods, only more recent theoretical results have enabled to study phenomena such as benign overfitting and double descent. In this talk, I will give an overview of my research on the generalization of neural network regression as well as some related works.
Scaling of machine learning models has significantly improved model capabilities, especially for self-supervised tasks. To scale efficiently, one must harmonize modeling efforts with good utilization of the computational resources at hand. In this talk I will review a selection of important LLM architectural choices with the perspective of efficiency. I will first walk through the high level mechanism of GPU computation along with their capabilities and constraints. I will dive deeper into a few aspects of the hardware and work to identify a few basic principles that guide hardware efficiency in the field. From there, I will motivate recent innovations in LLM execution and architecture from the lens of the hardware principles we identified, considering both LLM training and inference efficiency.
We investigate the complexity of deep neural networks through the lens of functional equivalence, which posits that different parameterizations can yield the same network function. Leveraging the equivalence property, we present a novel bound on the covering number for deep neural networks, which reveals that the complexity of neural networks can be reduced. Additionally, we demonstrate that functional equivalence benefits optimization, as overparameterized networks tend to be easier to train since increasing network width leads to a diminishing volume of the effective parameter space. These findings can offer valuable insights into the phenomenon of overparameterization and have implications for understanding generalization and optimization in deep learning.
Given a sequence of tokens, such as words, the task of next-token prediction is to predict the next-token conditional probability distribution. Decoder-only transformers have become effective models for this task, but their properties are still not fully understood. In particular, the largest number of distinct context sequences that a decoder-only transformer can interpolate next-token distributions for has not been established. To fill this gap, we prove upper and lower bounds on this number, which are equal up to a multiplicative constant. We prove these bounds in the general setting where next-token distributions can be arbitrary as well as the empirical setting where they are calculated from a finite number of document sequences. Our lower bounds are for one-layer transformers and our proofs highlight an important injectivity property satisfied by self-attention.
Mathematical reasoning is a pivotal component of human intelligence, crucial for advancing education and science. This talk delves into the development of language model systems capable of robust mathematical reasoning, marking a significant step toward realizing general artificial intelligence. We introduce multi-modal and knowledge-intensive benchmarks to assess the reasoning capabilities of large language models (LLMs) and vision-language models (VLMs) across real-world contexts, including visual information, tabular data, and scientific domains. This talk advances the field by presenting innovative retrieval and tool-augmented algorithms that enhance LLM capabilities. It concludes by analyzing the latest advances in mathematical reasoning within visual contexts, and highlighting the current challenges and future prospects.
In this talk, I will talk about a novel view on the problem of finding discriminant loci of the parametric system of polynomial equations as a machine learning problem, i.e., determining classification boundaries over the parameter space, with the classes being the number of real solutions of the system. The number of real solutions at various parameter points are obtained using the numerical polynomial homotopy continuation method. Machine learning techniques, such as deep learning and nearest neighbor, are employed to efficiently approximate the boundaries which in turn provides a decomposition of the parameter space in terms of the number of real solutions. It is shown that the proposed method can well-approximate complicated discriminant loci of various examples such as the Kuramoto model.
Online and bandit algorithms are powerful theoretical tools in machine learning, designed for scenarios requiring sequential decision-making with limited information. In this talk, I will explore two practical applications of these algorithms. The first application is online distribution shift adaptation, where the test data distribution changes over time. By making certain assumptions about the distribution shift, we can frame this problem within an online learning context and solve it using online algorithms. The second application involves identifying the best Large Language Model (LLM)-based method in resource-constrained settings. Due to the significant costs of time or money associated with querying LLMs, it's crucial to minimize the number of queries when identifying the best LLM-base method. We demonstrate that using multi-armed bandit algorithms combined with low-rank factorization allows us to efficiently identify the best method with far fewer queries compared to evaluating all candidate methods on all test data.
Linear networks are artificial neural networks that use linear activation functions. Despite their simplicity, these networks have the potential for understanding more complex architectures. In this talk, we focus on linear convolutional networks with arbitrary strides. The neuromanifold of such a network is a semialgebraic set, represented by a space of polynomials that admit specific factorizations. We introduce a recursive algorithm to derive polynomial equations whose common zeros define the Zariski closure of the neuromanifold. Additionally, we examine the algebraic complexity involved in training these networks using techniques from metric algebraic geometry. We show that the total number of complex critical points in optimizing these networks corresponds to the generic Euclidean distance degree of a Segre variety. This number is notably higher than the number of critical points found when training a fully connected linear network with the same number of parameters.
Activation functions play a pivotal role in deep neural networks, enabling them to tackle complex tasks like image recognition. However, activation functions also introduce significant challenges for deep learning theory, network dynamics analysis, and properties such as interpretability and privacy preservation. In this talk, we revisit the necessity of activation functions across various scenarios. Specifically, we explore expressing network outputs through high-degree interactions among input elements using multilinear algebra. This approach allows us to attain the necessary expressivity via these intricate interactions. Yet, the question remains: Is this expressivity alone sufficient for effective learning? Our recent research, presented at ICLR’24, unveils meticulously designed networks employing multilinear operations. Remarkably, these networks achieve strong performance even in demanding tasks, such as ImageNet image recognition.
Language models that have been trained to predict the next word over billions of text documents have been shown to also significantly predict brain recordings of people comprehending language. Understanding the reasons behind the observed similarities between language in machines and language in the brain can lead to more insight into both systems. Additionally, the human language system integrates information from multiple sensory modalities which puts text-only language models at a fundamental disadvantage as cognitive models. In this talk, we will discuss a series of recent works that make progress towards these questions along different dimensions. The unifying principle among these works that allows us to make scientific claims about why one black box (language model) aligns with another black box (the human brain) is our ability to make specific perturbations in the language model and observe their effect on the alignment with the brain. Building on this approach, these works reveal that the observed alignment is due to more than next-word prediction and word-level semantics and is partially related to joint processing of select linguistic information in both systems. Furthermore, we find that the brain alignment can be improved by training a language model to summarize narratives, and to incorporate auditory and visual information of a scene. Taken together, these works make progress towards determining the sufficient and necessary conditions under which language in machines aligns with language in the brain.
Bio: Mariya Toneva is a tenure-track faculty at the Max Planck Institute for Software Systems, where she leads the BrAIN (Bridging AI and Neuroscience) group. Her research is at the intersection of Machine Learning, Natural Language Processing, and Neuroscience, with a focus on building computational models of language processing in the brain that can also improve natural language processing systems. Prior to MPI-SWS, she was a C.V. Starr Fellow at the Princeton Neuroscience Institute and obtained her Ph.D. from Carnegie Mellon University in a joint program between Machine Learning and Neural Computation.
A mystery of modern neural networks is their surprising generalization power in overparametrized regime: they comprise so many parameters that they can interpolate the training set, even if actual labels are replaced by purely random ones; despite this, they achieve good prediction error on unseen data.
In this talk, we focus on the neural tangent (NT) model for two-layer neural networks, which is a simplified model. Under the isotropic input data, we first show that interpolation phase transition is around Nd ~ n, where Nd is the number of parameters and n is the sample size.
To demystify the generalization puzzle, we consider the min-norm interpolator and show that its test error/generalization error is largely determined by a smooth, low-degree component. Moreover, we find that nonlinearity of the activation function has an implicit regularization effect. These results offer new insights to recent discoveries in overparametrized models such as double descent phenomena.
Link to the paper: https://arxiv.org/abs/2007.12826
Bio: Yiqiao Zhong is currently an assistant professor at the University of Wisconsin—Madison, Department of Statistics. Prior to joining UW Madison, Yiqiao was a postdoc at Stanford University, advised by Prof. Andrea Montanari and Prof. David Donoho. His research interest includes analysis of Large Language Models, deep learning theory, and high-dimensional statistics. Yiqiao Zhong obtained his Ph.D. in 2019 from Princeton University, where he was advised by Prof. Jianqing Fan.
Understanding the geometric properties of gradient descent dynamics is a key ingredient in deciphering the recent success of very large machine learning models. A striking observation is that trained over-parameterized models retain some properties of the optimization initialization. This “implicit bias” is believed to be responsible for some favorable properties of the trained models and could explain their good generalization properties. In this work, we expose the definitions and properties of "conservation laws", that define quantities conserved during gradient flows of a given machine learning model, such as a ReLU network, with any training data and any loss. After explaining how to find the maximal number of independent conservation laws via Lie algebra computations, we provide algorithms to compute a family of polynomial laws, as well as to compute the number of (not necessarily polynomial) conservation laws. We obtain that on a number of architecture there are no more laws than the known ones, and we identify new laws for certain flows with momentum and/or non-Euclidean geometries.
Joint work with Sibylle Marcotte and Gabriel Peyré.
There are several competing strategies for achieving equivariance in neural networks. In this talk, we are going to take a theoretical closer look at two of them: data augmentation and architecture restriction. Our main question is when the two strategies are equivalent. The analysis will reveal that a geometrical relation between the network architecture and the space of equivariant linear layers will imply a weak equivalence of the two strategies. This talk is based on joint work with Fredrik Ohlsson and Oskar Nordenfors.
Training neural networks with first order optimisation methods is at the core of the empirical success of deep learning. The scale of initialisation is a crucial factor, as small initialisations are generally associated to a feature learning regime, for which gradient descent is implicitly biased towards simple solutions. This work provides a general and quantitative description of the early alignment phase, originally introduced by Maennel et al. (2018). For small initialisation and one hidden ReLU layer networks, the early stage of the training dynamics leads to an alignment of the neurons towards key directions. This alignment induces a sparse representation of the network, which is directly related to the implicit bias of gradient flow at convergence. This sparsity inducing alignment however comes at the expense of difficulties in minimising the training objective: we also provide a simple data example for which overparameterised networks fail to converge towards global minima and only converge to a spurious stationary point instead.
A core challenge in mathematical data science is to understand and leverage intrinsic structures of sets. With the rise of deep learning, however, the focus has been shifting more and more from explicit structural regularization in inverse problems and related fields to implicit regularization in massively overparametrized machine learning models. In this talk, I will present theoretical results on the implicit bias of gradient descent in overparametrized linear regression and matrix factorization. I will particularly discuss the role of reparametrization in the presented theory.
The variational lower bound (a.k.a. ELBO or free energy) is the central objective for many established as well as many novel algorithms for unsupervised learning. Different reformulations and generalizations of the ELBO have been used for different models including standard variational autoencoders (VAEs), beta-VAEs, diffusion models and many other probabilistic data models. Here we study a novel approach that allows for rewriting the ELBO as a sum of entropies. Given common probabilistic graphical models, we first show that the ELBO converges to an expression summing (positive and negative) entropy terms. The result applies under mild conditions, e.g., distributions defining the graphical model have to be in the exponential family. Secondly, we discuss entropy sums as a tool for the analysis of ELBO-based learning; and we will show how properties of entropies can be exploited for concrete models such as VAEs. Finally, we will present recent results showing how convergence to entropy sums can be used to transform ELBOs to entropy sum objectives, i.e., we will present example models for which entropy sums can directly serve as learning objectives. Numerical results of entropy-based analysis as well as entropy-based learning will support the presented theoretical results. The talk will close with a discussion of related research of my lab and of future work.
Work together with: Simon Damm, Dmytro Velychko, Jan Warnken, Zhenwen Dai, Dennis Forster, Asja Fischer
This talk will present a non-asymptotic analysis of recurrent neural networks trained with gradient descent for the regression problem in the kernel regime without massive overparameterization. Our in-depth analysis (i) provides sharp bounds on the network size and iteration complexity in terms of the sequence length, sample size and ambient dimension, and (ii) identifies the significant impact of long-term dependencies in the dynamical system on the performance bounds characterized by a cutoff point that depends on the modulus of continuity of the activation function. Remarkably, this analysis reveals that an appropriately-initialized recurrent neural network trained with $n$ samples can achieve near-optimality with a network size $m$ that scales only logarithmically with $n$. This sharply contrasts with the prior works that require high-order polynomial dependency of $m$ on $n$ to establish strong regularity conditions. Our results are based on an explicit characterization of the class of dynamical systems that can be approximated and learned by recurrent neural networks via norm-constrained transportation mappings, and establishing local smoothness properties of the hidden state with respect to the learnable parameters.
Neural networks with a large number of parameters do not overfit due to implicit regularization favoring good networks. This challenges the adequacy of parameter count as a measure of complexity. To understand implicit regularization, we study the output set geometry. Its local dimension, called batch functional dimension, varies. The batch functional dimension is almost surely determined by hidden layer activation patterns and remains invariant to network parameterization symmetries. Empirical evidence shows that the batch functional dimension decreases during optimization, when computed on training inputs and test inputs, resulting in parameters with low batch functional dimensions. We call this phenomenon geometry-induced implicit regularization. The batch functional dimension depends on both network parameters and inputs. We investigate the maximum achievable batch functional dimension under varying inputs for fixed parameters. We demonstrate that this quantity, termed computable full functional dimension, remains invariant to network parameterization symmetries and is determined by the achievable activation patterns. In addition, we present a sampling theorem that highlights the rapid convergence in the estimation of the computable full functional dimension. Empirical findings suggest that it is close to the number of parameters, which is related to the local identifiability of the parameters.
Neural networks are renowned for their ability to memorize datasets, often achieving near-zero training loss via gradient descent optimization. Despite this capability, they also demonstrate remarkable generalization to new data. This paper delves into studying the generalization behavior of neural networks trained with logistic loss through the lens of algorithmic stability. Our focus lies on the neural tangent regime, where network weights move a constant distance from initialization to solution to achieve minimal training loss. Our main finding reveals that under NTK-separability, optimal test loss bounds are achievable if the network width is at least poly-logarithmically large with respect to the number of training samples. This departure from existing generalization outcomes using algorithmic stability, which typically require polynomial width and yield suboptimal rates, underscores the significance of our approach. Moreover, our analysis presents improved generalization bounds and width lower bounds compared to prior works employing alternative methods such as uniform convergence via Rademacher complexity. The key to this improvement lies in leveraging the Hessian information of the objective function during gradient descent iterates. We demonstrate that neural networks of sufficiently large width trained by the logistic loss satisfy an approximate quasi-convexity property along the gradient descent path. To demonstrate the practical implications of our findings, we specialize our analysis to a XOR dataset, where we present refined width conditions.
Biography: Hossein Taheri received the B.Sc. degree in electrical engineering and mathematics from the Sharif University of Technology, Tehran, Iran, in 2018. He is currently pursuing the Ph.D. degree in electrical and computer engineering under the guidance of Christos Thrampoulidis at the University of California at Santa Barbara. His main area of research is statistical learning and optimization.
In this talk, I will present a framework, inspired by random matrix theory, for analyzing the dynamics of stochastic optimization algorithms (e.g., stochastic gradient descent (SGD) and momentum (SGD + M)) when both the number of samples and dimensions are large. Using this new framework, we show that the dynamics of optimization algorithms on generalized linear models and multi-index problems with random data become deterministic in the large sample and dimensional limit. In particular, the limiting dynamics for stochastic algorithms are governed by an ODE. In the least square setting, from this model, we identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of O(1/ κ), matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD+M has rates that scale like a multiple of the single batch SGD rate. We give explicit choices for the learning rate and momentum parameter in terms of the Hessian spectra that achieve this performance. Finally we show this model matches performances on real data sets.
Gradient-based methods are the workhorse behind modern machine learning. While they have been extensively studied in the basic framework of supervised learning, they are far less understood in the framework of optimal control, which in its broadest form is equivalent to reinforcement learning. There, algorithms that learn a policy via gradient updates are known as policy gradient methods. In this talk, I will present two recent works analyzing the optimization dynamics and implicit bias of policy gradient, in different contexts. The first work identifies a vanishing gradients problem that occurs when using policy gradient to finetune language models. I will demonstrate the detrimental effects of this phenomenon and present possible solutions. The second work characterizes how the implicit bias of policy gradient affects extrapolation to initial states unseen in training, focusing on the fundamental Linear Quadratic Regulator (LQR) control problem. Overall, our results highlight that the optimization dynamics and implicit bias of policy gradient can substantially differ from those of gradient-based methods in supervised learning, hence require dedicated study.
Works covered in the talk were in collaboration with Nadav Cohen, Yotam Alexander, Edo Cohen-Karlik, Raja Giryes, Amir Globerson, Hattie Zhou, Omid Saremi, Vimal Thilak, Arwen Bradley, Preetum Nakkiran, Joshua Susskind, and Etai Littwin.
Machine learning models, while effective in controlled environments, can fail catastrophically when exposed to unexpected conditions upon deployment. This lack of robustness, well-documented even in state-of-the-art models, can lead to severe harm in high-stakes, safety-critical application domains such as healthcare. This shortcoming raises a central question: How can we develop machine learning models we can trust?
In this talk, I will approach this question from a probabilistic perspective and address deficiencies in the trustworthiness of neural network models using Bayesian principles. Specifically, I will show how to improve the reliability and fairness of neural networks with data-driven domain-informed prior distributions over model parameters. To do so, I will first demonstrate how to perform training in neural networks with such priors using a simple variational optimization objective with a regularizer that reflects the constraints implicitly encoded in the prior. This regularizer is mathematically simple, easy to implement, and can be used as a drop-in replacement for existing regularizers when performing supervised learning in neural networks of any size. I will then show how to construct and use domain-informed data-driven priors to improve uncertainty quantification and group robustness in neural network models for selected application domains.
Sharpness of minima is a promising quantity that can correlate with generalization in deep networks and, when optimized during training, can improve generalization. However, standard sharpness is not invariant under reparametrizations of neural networks, and, to fix this, reparametrization-invariant sharpness definitions have been proposed, most prominently adaptive sharpness (Kwon et al., 2021). But does it really capture generalization in modern practical settings? We comprehensively explore this question in a detailed study of various definitions of adaptive sharpness in settings ranging from training from scratch on ImageNet and CIFAR-10 to fine-tuning CLIP on ImageNet and BERT on MNLI. We focus mostly on transformers for which little is known in terms of sharpness despite their widespread usage. Overall, we observe that sharpness does not correlate well with generalization but rather with some training parameters like the learning rate that can be positively or negatively correlated with generalization depending on the setup. Interestingly, in multiple cases, we observe a consistent negative correlation of sharpness with out-of-distribution error implying that sharper minima can generalize better. Finally, we illustrate on a simple model that the right sharpness measure is highly data-dependent, and that we do not understand well this aspect for realistic data distributions.
Statistical learning theory provides bounds on the necessary number of training samples needed to reach a prescribed accuracy in a learning problem formulated over a given target class. This accuracy is typically measured in terms of a generalization error, that is, an expected value of a given loss function. However, for several applications -- for example in a security-critical context or for problems in the computational sciences -- accuracy in this sense is not sufficient. In such cases, one would like to have guarantees for high accuracy on every input value, that is, with respect to the uniform norm. In this paper we precisely quantify the number of training samples needed for any conceivable training algorithm to guarantee a given uniform accuracy on any learning problem formulated over target classes containing (or consisting of) ReLU neural networks of a prescribed architecture. We prove that, under very general assumptions, the minimal number of training samples for this task scales exponentially both in the depth and the input dimension of the network architecture. As a corollary we conclude that the training of ReLU neural networks to high uniform accuracy is intractable. In a security-critical context this points to the fact that deep learning based systems are prone to being fooled by a possible adversary. We corroborate our theoretical findings by numerical results.
ArXiv / ICLR'23: https://arxiv.org/abs/2205.13531
https://openreview.net/forum?id=nchvKfvNeX0
Github: https://github.com/juliusberner/theory2practice
https://jberner.info/
I will revisit the problem of maximizing information divergence from a new perspective using logarithmic Voronoi polytopes. We will see that for linear models, the maximum is always achieved at the boundary of the probability simplex. For toric models, I will describe an algorithm that combines the combinatorics of the chamber complex with numerical algebraic geometry. I will pay special attention to reducible models and models of maximum likelihood degree one, with many colorful examples. This talk is based on joint work with Serkan Hoşten.
We develop an analytical framework to characterize the set of optimal ReLU neural networks by reformulating the non-convex training problem as a convex program. We show that the global optima of the convex parameterization are given by a polyhedral set and then extend this characterization to the optimal set of the non-convex training objective. Since all stationary points of the ReLU training problem can be represented as optima of sub-sampled convex programs, our work provide a general expression for all critical points of the non-convex objective. We then leverage our results to provide an optimal pruning algorithm for computing minimal networks, establish conditions for the regularization path of ReLU networks to be continuous, and develop sensitivity results for minimal ReLU networks.
Model reparametrization, which follows the change-of-variable rule of calculus (not to be confused with weight-space symmetry), is a popular way to improve the training of neural nets, e.g. in WeightNorm. But it can also be problematic since it can induce inconsistencies in, e.g., Hessian-based flatness measures, optimization trajectories, and modes of probability densities. This complicates downstream analyses: e.g. one cannot definitively relate flatness with generalization since arbitrary reparametrization changes their relationship. In this talk, I will present a study of the invariance of neural nets under reparametrization from the perspective of Riemannian geometry. From this point of view, invariance is an inherent property of any neural net if one explicitly represents the metric and uses the correct associated transformation rules. This is important since although the metric is always present, it is often implicitly assumed as identity, and thus dropped from the notation, then lost under reparametrization. I will discuss implications for measuring the flatness of minima, optimization, and for probability-density maximization. As a bonus, I will also give a teaser of our other recent work in exploiting the geometry of preconditioning matrices to develop an inverse-free, structured KFAC-like second-order optimization method for very large, modern neural nets like transformers. The resulting method is thus numerically stable in low precision and also memory efficient.
This talk will discuss some nontrivial but often pleasant effects of large learning rates, which are commonly used in machine learning practice for improved empirical performances, but defy traditional theoretical analyses. I will first quantify how large learning rates can help gradient descent in multiscale landscape escape local minima, via chaotic dynamics, which provides an alternative to the commonly known escape mechanism due to noises from stochastic gradients. I will then report how large learning rates provably bias toward flatter minimizers. Several related, perplexing phenomena have been empirically observed recently, including Edge of Stability, loss catapulting, and balancing. I will, for the first time, unify them and explain that they are all algorithmic implicit biases of large learning rates. These results are enabled by a first global convergence proof of gradient descent for certain nonconvex functions without Lipschitz gradient. This theory will also provide understanding of when there will be Edge of Stability and other large learning rate implicit biases.
Designing effective positional encodings for graphs is key to building powerful graph transformers and enhancing message-passing graph neural networks’ expressive power. However, since there lacks a canonical order of nodes in the graph-structure data, the choice of positional encodings for graphs is often tricky. For example, Laplacian eigenmap is used as positional encodings in many works. However, it faces two fundamental challenges: (1) Non-uniqueness: there are many different eigendecompositions of the same Laplacian, and (2) Instability: small perturbations to the Laplacian could result in completely different eigenvectors, leading to unpredictable changes in positional encoding. This is governed by the Davis-Kahan theorem, which further negatively impacts the model generalization. In this talk, we are to introduce some ideas on building stable positional encoding and show their benefits in model out-of-distribution generalization. The idea can be extended to some other types of node positional encodings. Finally, we evaluate the effectiveness of our method on molecular property prediction, link prediction, and out-of-distribution generalization tasks, finding improved generalization compared to existing positional encoding methods.
I will mainly talk about these two papers:
1. Equivariant and Stable Positional Encoding for More Powerful Graph Neural Networks Haorui Wang, Haoteng Yin, Muhan Zhang, Pan Li
2. On the Stability of Expressive Positional Encodings for Graphs Yinan Huang, William Lu, Joshua Robinson, Yu Yang, Muhan Zhang, Stefanie Jegelka, Pan Li
Neural networks with random weights appear in a variety of machine learning applications, most prominently as the initialization of many deep learning algorithms and as a computationally cheap alternative to fully learned neural networks. The first goal of this talk is to enhance the theoretical understanding of random neural networks by addressing the following data separation problem: under which conditions can a random neural network make two classes (with positive distance) linearly separable? I will show that a sufficiently large two-layer ReLU-network with Gaussian weights and uniformly distributed biases can solve this problem with high probability. Building on the insights behind this result, I will next present a simple randomized algorithm to produce a small interpolating neural net for a given dataset with two classes. In both results, the size of the network is explicitly linked to geometric properties of the two classes and their mutual arrangement. This instance-specific viewpoint allows to overcome the curse of dimensionality and to prove memorization results beyond worst case lower bounds.
In this talk, I will present a framework, inspired by random matrix theory, for analyzing the dynamics of stochastic optimization algorithms (e.g., stochastic gradient descent (SGD) and momentum (SGD + M)) when both the number of samples and dimensions are large. Using this new framework, we show that the dynamics of optimization algorithms on generalized linear models and multi-index problems with random data become deterministic in the large sample and dimensional limit. In particular, the limiting dynamics for stochastic algorithms are governed by an ODE. In the least square setting, from this model, we identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of O(1/ κ), matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD+M has rates that scale like a multiple of the single batch SGD rate. We give explicit choices for the learning rate and momentum parameter in terms of the Hessian spectra that achieve this performance. Finally we show this model matches performances on real data sets.
In many problems arising in data science and scientific engineering, one must reconstruct a signal from few corrupted measurements, a class of problems known more broadly as inverse problems. Due to the ill-posedness of such problems, one requires a prior: a mathematical model of what makes some signals natural. Recently, there have been great empirical strides in deep generative models, a class of neural network-based models that learn to sample signals from a distribution of interest. Compared to classical priors like sparsity, generative priors lack the same understanding of when and why they work well, especially in relation to notions such as statistical complexity and computational efficiency. Here, I will discuss some theory and applications of using such priors in inverse problems. I will first present a rigorous recovery guarantee for generative priors in the nonlinear inverse problem of phase retrieval. I will show that generative models enable an efficient algorithm from generic measurements with optimal sample complexity. In the second part of the talk, I will address how one can exploit ideas from this approach to tackle problems where ground-truth data may be unavailable in practice, including problems arising from black hole imaging.
I will introduce the idea of phases and phase transitions of the Bayesian posterior in the setting of singular learning theory, and discuss how a simple auto-encoder model introduced by Anthropic in their research on neural network interpretability displays a rich set of phase transitions in both the posterior and over the course of training. I’ll explain a research program we term “developmental” interpretability that is aiming to use phase transitions as the basic primitive for understanding the internal structure of computation in neural networks.
Despite the widespread proliferation of neural networks, the mechanisms through which they operate so successfully are not well understood. In this talk, we will first explore empirical and theoretical investigations into neural network training and generalization and what they can tell us about why deep learning works. Then, we will examine a recent line of work on algorithm learning. While typical neural networks are designed for pattern matching tasks, we consider whether neural networks can learn algorithms that scale to problem instances orders of magnitude larger than those seen during training.
In this talk I’ll discuss recent work studying benign overfitting in two-layer ReLU networks trained using gradient descent and hinge loss on noisy data for binary classification. In particular, we consider linearly separable data for which a relatively small proportion of labels are corrupted and identify conditions on the margin of the clean data which give rise to three distinct training outcomes: benign overfitting, in which zero loss is achieved and with high probability test data is classified correctly; overfitting, in which zero loss is achieved but test data is misclassified with probability lower bounded by a constant; and non-overfitting, in which clean points, but not corrupt points, achieve zero loss and again with high probability test data is classified correctly. Our analysis provides a fine-grained description of the dynamics of neurons throughout training and reveals two distinct phases: in the first phase clean points achieve close to zero loss, in the second phase clean points oscillate on the boundary of zero loss while corrupt points either converge towards zero loss or are eventually zeroed by the network. We prove these results using a combinatorial approach that involves bounding the number of clean versus corrupt updates across these phases of training.
The convergence of scientific computing with statistics and machine learning is an exciting development in scientific computing. In this talk, I will present two lines of work that blur the line between statistical inference and numerical computation. The first part of the talk presents state-of-the-art algorithms for solving elliptic PDEs by interpreting them as Gaussian processes and exploiting their conditional independence properties. This approach allows the efficient learning of elliptic solution operators from solution pairs, with promising empirical results on fractional order PDEs and empirical closure models of turbulent flows.
The second part of the talk discusses how to mitigate the formation of shock singularities in the barotropic Euler equations using an inviscid regularization. This work combines the seminal work of Vladimir Arnold on geometric hydrodynamics with ideas from interior point methods for positive definite programming and the information geometry of Amari and Chentsov.
Recent years have seen substantial interest in a first-principles theoretical understanding of the behavior of overparameterized models that interpolate noisy training data, based on their surprising empirical success. In this talk, I compare classification and regression tasks in the overparameterized linear model. On the one hand, we show that with sufficient overparameterization, solutions obtained by training on the squared loss ( minimum-norm interpolation) typically used for regression, are identical to those produced by training on exponential and polynomially-tailed losses (e.g. the max-margin support-vector-machine), typically used for classification. On the other hand, we show that there exist regimes where these solutions are consistent when evaluated by the 0-1 test loss function, but inconsistent if evaluated by the mean-squared-error test loss function. Our results demonstrate that: a) different loss functions at the training (optimization) phase can yield similar solutions, and b) a significantly higher level of effective overparameterization admits good generalization in classification tasks as compared to regression tasks.
Deep learning has been wildly successful in practice and most state-of-the-art artificial intelligence systems are based on neural networks. Lacking, however, is a rigorous mathematical theory that adequately explains the amazing performance of deep neural networks. In this talk, I present a new mathematical framework that provides the beginning of a deeper understanding of deep learning. This framework precisely characterizes the functional properties of trained neural networks. The key mathematical tools which support this framework include transform-domain sparse regularization, the Radon transform of computed tomography, and approximation theory. This framework explains the effect of weight decay regularization in neural network training, the importance of skip connections and low-rank weight matrices in network architectures, the role of sparsity in neural networks, and explains why neural networks can perform well in high-dimensional problems.
Training deep neural networks for classification often includes minimizing the training loss beyond the zero training error point. In this phase of training, a "neural collapse" (NC) behavior has been empirically observed: the variability of features (outputs of the penultimate layer) of within-class samples decreases and the mean features of different classes approach a certain tight frame structure. Recent papers have shown that minimizers with this structure emerge when optimizing a simplified "unconstrained features model" (UFM) with a regularized cross-entropy loss. In this talk, I will start with a review on the empirical findings on the NC phenomenon. Then, I will present some of our theoretical results (from our publications at ICML22 and ICML23). Namely, we show that the minimizers of a UFM exhibit (a somewhat more delicate) NC structure also with regularized MSE loss, and we analyze the gradient flow in this case, identifying the different dynamics of the features' within- and between-class covariance matrices. Then, we extend the UFM in two different ways to provide mathematical reasoning for the depthwise empirical NC behavior and the effect of regularization hyperparameters on the closeness to collapse. We support our theory with experiments in practical deep learning settings.
In this talk, we will propose a new model and algorithms that go beyond worst-case rates in sequential prediction. In particular, we will focus on sequential prediction over a stochastic sequence with an adversary that is allowed to inject clean-label adversarial (or out-of-distribution) examples as and when they desire. Traditional algorithms designed to handle purely stochastic data tend to fail in the presence of such adversarial examples, often leading to erroneous predictions, whereas, assuming fully adversarial data leads to very pessimistic bounds that are often vacuous in practice. To mitigate this, we will introduce a new model of sequential prediction that sits between the purely stochastic and fully adversarial settings by allowing the learner to abstain from making a prediction at no cost on adversarial examples. Assuming access to the marginal distribution on the non-adversarial examples, we will present a learner whose error scales with the VC dimension (mirroring the stochastic setting) of the hypothesis class, as opposed to the Littlestone dimension which characterizes the fully adversarial setting. Furthermore, we will design learners for certain special hypothesis classes including VC dimension 1 classes, which work even in the absence of access to the marginal distribution. We will conclude with several open questions and plausible connections of our framework with existing models. This is based on joint work with Steve Hanneke, Shay Moran, and Abhishek Shetty.
Inspired by the remarkable success of deep neural networks, there has been significant interest in understanding the generalization performance of overparameterized models. Substantial efforts have been invested in characterizing how optimization algorithms impact generalization through their “preferred” solutions, a phenomenon commonly referred to as implicit regularization. For instance, it has been argued that gradient descent (GD) induces an implicit $\ell_2$-norm regularization in regression and classification problems. However, in prior literature, the implicit regularization of different algorithms are confined to either a specific geometry or a particular class of learning problems. To address this gap, we present a unified approach using mirror descent (MD), a notable generalization of GD, to control implicit regularization in both regression and classification settings. More specifically, we show that MD with the general class of homogeneous potential functions converges in direction to a generalized maximum-margin solution for linear classification problems, thereby answering a long-standing question in the classification setting. Furthermore, MD can be implemented efficiently and we demonstrate through experiments that MD is a versatile method to produce learned models with different regularizers and different generalization performances.
How do we ensure that the machine learning algorithms in high-stakes applications, such as hiring, lending, admissions, etc., are fair, explainable, and lawful? Towards addressing this urgent question, this talk will provide some strategies that are deep-rooted in information theory, causality, and statistics. I will discuss a question that bridges the fields of fairness, explainability, and law: how do we check if the disparity in a model is purely due to critical occupational necessities or not? We propose a systematic measure of the legally non-exempt disparity, that brings together a body of work in information theory called Partial Information Decomposition and connects it with causality. I will also briefly talk about some of our other research interests in related topics, such as quantifying accuracy-fairness tradeoffs using Chernoff Information, and also robust counterfactual explanations with theoretical guarantees.
I will describe recent progress on providing rigorous convergence guarantees for score-based generative models (SGMs) such as denoising diffusion probabilistic models (DDPMs), which constitute the backbone of large-scale real-world generative models such as DALL⋅E 2. In the first part of the talk, I will show that such SGMs can efficiently sample from essentially any realistic data distribution, even ones which are highly non-log-concave. In the second part of the talk, I will show how to extend these guarantees to deterministic samplers based on discretizing the so-called probability flow ODE, which ultimately leads to better dependence on the dimension. All of these results assume access to an oracle for score estimation; time permitting, at the end I will briefly touch upon how to provably implement this oracle for interesting classes of distributions like Gaussian mixtures.
Based on the following works: https://arxiv.org/abs/2209.11215, https://arxiv.org/abs/2303.03384, https://arxiv.org/abs/2305.11798, https://arxiv.org/abs/2307.01178
In practice, encoding group invariances into models improves sample complexity. In this talk, we study this phenomenon from a theoretical perspective. In the first part, we provide minimax optimal rates for kernel ridge regression on compact manifolds, with a target function that is invariant to a Lie group action on the manifold. For a finite group, the gain effectively multiplies the number of samples by the group size. For groups of positive dimension, the gain is observed by a reduction in the manifold’s dimension, in addition to a factor proportional to the volume of the quotient space. Our proof takes the viewpoint of differential geometry, in contrast to the more common strategy of using invariant polynomials.
In the second part, we apply our techniques to group-invariant probability distributions that appear in many data-generative models in machine learning, such as graphs, point clouds, and images, and we prove tight sample complexity bounds for the Wasserstein distance estimation under invariances, as well as the density estimation problem.
This talk is based on joint work with Stefanie Jegelka.
Recent theoretical work has shown that under certain conditions, massively overparameterized neural networks are equivalent to kernel regressors with a family of kernels called Neural Tangent Kernels (NTKs). My work in this subject aims to better understand the properties of NTK for various network architectures and relate them to the inductive bias of real neural networks. In particular, I will argue that for input data distributed uniformly on the sphere NTK favors low-frequency predictions over high-frequency ones, potentially explaining why overparameterized networks can generalize even when they perfectly fit their training data. I will further discuss the behavior of NTK when data is distributed nonuniformly and show that NTK (with ReLU activation) is tightly related to the classical Laplace kernel, which has a simple closed-form. Finally, I will discuss our analysis of NTK for convolutional networks, which indicates that these networks are biased toward learning low frequency target functions with any higher frequencies concentrated in local regions. Overall, our results suggest that much insight about neural networks can be obtained from the analysis of NTK.
Algorithmic reasoning requires capabilities which are most naturally understood through recurrent models of computation, like the Turing machine. However, Transformer models, while lacking recurrence, are able to perform such reasoning using far fewer layers than the number of reasoning steps. This raises the question: what solutions are these shallow and non-recurrent models finding? In this talk, we will formalize reasoning in the setting of automata, and show that the computation of an automaton on an input sequence of length T can be replicated exactly by Transformers with o(T) layers, which we call "shortcuts". We provide two constructions, with O(log T) layers for all automata and O(1) layers for solvable automata. Empirically, our results from synthetic experiments show that shallow solutions can also be found in practice.
Understanding the training dynamics of neural networks is quite difficult in general due to the highly nonlinear nature of the parameterization. A breakthrough in the theory of deep learning was the finding that in the infinite-width limit the gradient descent dynamics are characterized by a fixed kernel, coined the “Neural Tangent Kernel” (NTK). In this limiting regime the network is biased to learn the eigenvectors/eigenfunctions of the NTK at rates corresponding to their eigenvalues, a phenomenon known as “spectral bias”. Considerable work has been done comparing the training dynamics of finite-width networks to the idealized infinite-width dynamics. These works typically compare the dynamics of a finite-width network to the dynamics of an infinite-width network where both networks are optimized via the empirical risk. In this work we compare a finite-width network trained on the empirical risk to an infinite-width network trained on the population risk. Consequentially, we are able to demonstrate that the finite-width network is biased towards learning the top eigenfunctions of the NTK over the entire input domain, as opposed to describing the dynamics merely on the training set. Furthermore we can demonstrate that this holds in a regime where the network width is on the same order as the number of training samples, in contrast with prior works that require the unrealistic assumption that the network width is polynomially large in the number of samples. In a separate line of analysis, we characterize the spectrum of the NTK by expressing the NTK as a power series. We demonstrate that the NTK has a small number of large outlier eigenvalues and that the number of such eigenvalues is largely inherited from the structure of the input data. As a result we shed further insight into why the network places a preference on learning a small number of components quicker. In total, our results help classify the properties networks are biased towards in a variety of settings, which we hope will lead to more interpretable artificial intelligence in the long term.
High dimensional probability is a powerful tool to understand phenomena that typically occur in deep learning models. In particular, in this talk, we discuss different concentration bounds that entail consequences about (i) the gradient descent optimization of a deep neural network, (ii) the adversarial robustness of its solution, and (iii) its privacy guarantees. The first main result implies that minimally overparameterized deep neural networks can be successfully (0 training loss) optimized with gradient descent, and leverages tight lower bounds on the smallest eigenvalue of the neural tangent kernel (NTK). The second builds on the recently proposed universal law of robustness, provides sharper bounds on the random features and NTK model, and consequently addresses the conjecture opened by Bubeck, Li and Nagaraj. Finally, we draw a connection between the generalization performance of a model and its privacy guarantees, measured in terms of its safety against a family of information recovery black-box attacks.
Studying the generalization abilities of linear models with real data is a central question in statistical learning. While there exist a limited number of prior works that do validate theoretical work with real data, these works have limitations due to technical assumptions. These assumptions include having a well-conditioned covariance matrix and having independent and identically distributed data. Additionally, prior works that do address distributional shifts usually make technical assumptions on the joint distribution of the train and test data, and do not test on real data. Previous work has also shown that double descent can occur in the over-parameterized regime, and believe that the standard bias-variance trade-off holds in the under-parameterized regime.
In an attempt to address these issues and better model real data, we look at data that is not I.I.D. but has a low-rank structure. Further, we address distributional shift by decoupling assumptions on the training and test distribution. We provide analytical formulas for the generalization error of the denoising problem that are asymptotically exact. These are used to derive theoretical results for linear regression, data augmentation, principal component regression, and transfer learning. We validate all of our theoretical results on real data and have a low relative mean squared error of around 1% between the empirical risk and our estimated risk. Further, we present a simple examplein this paradigm that provably exhibits double descent in the under-parameterized regime.
It was observed that large language models exhibit a power-law decay of cross entropy with respect to the number of parameters and training tokens. When extrapolated literally, this decay implies that the entropy rate of natural language is zero. To understand this phenomenon - or an artifact - better, we construct a simple stationary stochastic process and its memory-based predictor that exhibit a power-law decay of cross entropy with the vanishing entropy rate. Our example is based on previously discussed Santa Fe processes, which decompose a random text into a process of narration and time-independent knowledge. Previous discussions assumed that narration is a memoryless source with Zipf's distribution. In this talk, we propose a model of narration that has the vanishing entropy rate and applies a randomly chosen deterministic sequence called a multiperiodic sequence. Under a suitable parameterization, multiperiodic sequences exhibit asymptotic relative frequencies given by Zipf's law. Remaining agnostic about the value of the entropy rate of natural language, we discuss relevance of similar constructions for language modeling.
Based on: https://arxiv.org/abs/2302.09049
Pruning neural networks at initialization (PaI) has received an upsurge of interest due to its end-to-end saving potential. PaI is able to find sparse subnetworks at initialization that can achieve comparable performance to the full networks. These methods can surpass the trivial baseline of random pruning but suffer from a significant performance gap compared to post-training pruning. Previous approaches firmly rely on weights, gradients, and sanity checks as primary signals when conducting PaI analysis. To better understand the underlying mechanism of PaI, we propose to interpret it through the lens of the Ramanujan Graph - a class of expander graphs that are sparse while being highly connected. It is often believed there should be a strong correlation between the Ramanujan graph and PaI since both are about finding sparse and well-connected neural networks. However, the finer-grained link relating highly sparse and connected networks to their relative performance (i.e., ranking of difference sparse structures at the same specific global sparsity) is still missing. We observe that not only the Ramanujan property for sparse networks shows no significant relationship to PaI’s relative performance, but maximizing it can also lead to the formation of pseudo-random graphs with no structural meanings. We reveal the underlying cause to be Ramanujan Graph’s strong assumption on the upper bound of the largest nontrivial eigenvalue (\mu) of layers belonging to highly sparse networks. We hence propose Iterative Mean Difference of Bound (IMDB) as a mean to relax the \mu upper bound. Likewise, we also show there exists a lower bound for \mu, which we call the Normalized Random Coefficient (NaRC), that gives us an accurate assessment for when sparse but highly connected structure degenerates into naive randomness. Finally, we systematically analyze the behavior of various PaI methods and demonstrate the utility of our proposed metrics in characterizing PaI performance. We show that subnetworks preserving better the IMDB property correlate higher in performance, while NaRC provides us with a possible mean to locate the region where highly connected, highly sparse, and non-trivial Ramanujan expanders exist.
Flatness of the loss curve is conjectured to be connected to the generalization ability of machine learning models, in particular neural networks. While empirical evidence consistently shows a strong correlation between flatness measures and generalization, the theoretical explanation for this connection remains an open problem. Our recent paper, "Relative Flatness and Generalization", proposes an explanation for this relationship by linking it to a model's robustness under feature perturbations and a measure of how representative the training data is of its underlying distribution. We establish a necessary condition for the connection between flatness and generalization to hold.
In this talk, I will provide an overview of our paper's results, focusing on the motivations, insights, and limitations of our approach.
In this talk I will present various insights on the double descent curve obtained by considering a solvable model for deep learning : the random feature model. First, I will present a fine-grained bias-variance decomposition and show how the double descent curve can be reconciled with the traditional bias-variance tradeoff. Then, I will show that two different kinds of overfitting, which are often conflated, can give rise to a “double descent” curve, and can actually occur simultaneously, leading to a triple descent curve. Finally, I will extend some of these findings to classification tasks on structured data, showing the impact of the loss function and the role of low-dimensional structures.
The ability to make sequential decisions under uncertainty is a key component of intelligence. Despite impressive breakthroughs in deep learning in the last decade, we find that scalable and generalizable decision making has so far been elusive for current artificial intelligence (AI) systems. In this talk, I will present a new framework for sequential decision making that is derived from modern generative models for language and perception. We will instantiate our framework in 3 different paradigms for sequential decision making: offline reinforcement learning (RL), online RL, and black-box optimization, and highlight the simplicity and effectiveness of this unifying framework on a range of challenging high-dimensional benchmarks for sequential decision making.
I will present, in this talk, the implicit bias of gradient flow on linear equivariant steerable networks in group-invariant binary classification. Our findings reveal that the parameterized predictor converges in direction to the unique group-invariant classifier with a maximum margin defined by the input group action. Under a unitary assumption on the input representation, we establish the equivalence between steerable networks and data augmentation. Furthermore, we demonstrate the improved margin and generalization bound of steerable networks over their non-invariant counterparts. This talk is based on a joint work with Ziyu Chen at UMass Amherst.
In this talk, I will discuss our research on understanding the infinite-width limit of neural networks. In this limit, neural networks correspond to Neural Network Gaussian Processes (NNGPs) and Neural Tangent Kernels (NTKs). I will first describe our empirical study exploring the relationship between wide neural networks and neural kernel methods. Our study resolves a variety of open questions related to infinitely wide neural networks and opens up new interesting questions. In the second half of the talk, I will discuss our recent work on scaling up infinite-width neural kernel methods to millions of data points. There are unique challenges in scaling up neural kernel methods, and I will talk about our attempts to overcome them. If there is time, I will discuss some applications of the infinite-width limit of neural networks, such as dataset distillation, neural architecture search, uncertainty quantification, and neural scaling laws.
Score-based generative models and diffusion probabilistic models have exhibited exceptional performance in various applications. A natural question that arises is whether the distribution generated by the model is closely aligned with the given data distribution. In this talk, we will explore an upper bound of the Wasserstein distance between these two distributions. Based on the theory of optimal transport, we guarantee that the framework can approximate data distributions in the space of probability measures equipped with the Wasserstein distance.
This talk is based on joint work with Ying Fan and Kangwook Lee (UW-Madison).
Stochastic finite-sum optimization problems are ubiquitous in many areas such as machine learning, and stochastic optimization algorithms to solve these problems are actively studied in the literature. However, there is a major gap between practice and theory: practical algorithms shuffle and iterate through component indices, while most theoretical analyses of these algorithms assume uniformly sampling the indices in an i.i.d. manner. In this talk, we talk about recent research efforts to close this theory-practice gap. We will discuss recent developments in the convergence analysis of shuffling-based optimization methods. We will first consider minimization algorithms, mainly focusing on stochastic gradient descent (SGD) with shuffling; we will then briefly talk about some new progress on minimax optimization methods.
Transport maps characterize probability distributions by coupling random variables using a deterministic transformation. While bijective transport maps enable sampling and density evaluations of the transformed random variable, learning the parameters of such transformations in high dimensions is challenging given few samples from an unknown target distribution. Moreover, structural choices for these transformations can have a significant impact on their performance and the optimization procedure for finding these maps. In this talk, we will present a framework for representing and learning monotone triangular maps via invertible transformations of smooth functions, and will first demonstrate that the associated optimization problem has no spurious local minima, i.e., all local minima are global minima. Second, we propose a sample-efficient adaptive algorithm that construct a sparse approximation for the map. We demonstrate how this framework can be applied for joint and conditional density estimation, likelihood-free inference, and structure learning of directed graphical models with stable generalization performance across a range of sample sizes.
First-order optimization algorithms are widely used today. Two standard building blocks in these algorithms are proximal operators (proximals) and gradients. Although gradients can be computed for a wide array of functions, explicit proximal formulas are only known for limited classes of functions. We provide an algorithm, HJ-Prox, for accurately approximating such proximals. This is derived from a collection of relations between proximals, Moreau envelopes, Hamilton-Jacobi (HJ) equations, heat equations, and importance sampling. In particular, HJ-Prox smoothly approximates the Moreau envelope and its gradient. The smoothness can be adjusted to act as a denoiser. Our approach applies even when functions are only accessible by (possibly noisy) blackbox samples. Our approach can also be embedded into a zero-order algorithm with guaranteed convergence to global minima, assuming continuity of the objective function; this is done by leveraging connections between the gradient of the Moreau envelope and the proximal operator. We show HJ-Prox is effective numerically via several examples.
Policy gradient algorithms are amongst the most widely used reinforcement learning methods and have driven a lot of recent empirical success in applications of reinforcement learning, especially with deep neural network based policies. These methods apply to complex, poorly understood control problems by performing stochastic gradient descent over a parameterized class of policies. However, even for simple control problems solvable by classical dynamic programming techniques, policy gradient algorithms face non-convex optimization problems and are widely understood to converge only to a stationary point. We make fundamental progress in developing a theory for policy gradient methods - providing insights into when and why policy gradients are guaranteed to converge to globally optimal policies. In particular, we identify structural properties of the underlying markov decision process (MDP) which ensure that despite non-convexity, the policy gradient objective function has no suboptimal stationary points. We also show that the policy gradient objective satisfies a Polyak-lojasiewicz (gradient dominance) condition when some of these conditions are strengthened, yielding convergence rates. When some of these conditions are relaxed, we provide bounds on the optimality gap of any stationary point. Our results leverage on a key connection to policy iteration, a classic and well-studied dynamic programming algorithm, via a policy gradient theorem.
The parameter space for any fixed architecture of neural networks serves as a proxy during training for the associated class of functions - but how faithful is this representation? For any fixed feedforward ReLU network architecture with at least one hidden layer, it is well-known that many different parameter settings can determine the same function. It is less well-known that the degree of this redundancy is inhomogeneous across parameter space. This inhomogeneity should impact the dynamics of training via gradient descent, especially when compared with recent work suggesting that SGD favors flat minima of the loss landscape.
In this talk, I will carefully define the notion of the local functional dimension of a feedforward ReLU network function, discuss the relationship between local functional dimension of a parameter and the geometry of the underlying decomposition of the domain into linear regions, and present some preliminary experimental results suggesting that the probability distribution of the functional dimension at initialization is both interesting and architecture-dependent. Some of this work is joint with Kathryn Lindsey, Rob Meyerhoff, and Chenxi Wu, and some is joint with Kathryn Lindsey and David Rolnick.
We revisit the challenging problem of training Gaussian-Bernoulli restricted Boltzmann machines (GRBMs), introducing two innovations. We propose a novel Gibbs-Langevin sampling algorithm that outperforms existing methods like Gibbs sampling. We modified the contrastive divergence (CD) algorithm in two ways: 1) adding variance-dependent initial step sizes for negative sampling; 2) drawing initial negative samples from Gaussian noise. We show this modified CD along with gradient clipping is enough to robustly train GRBMs with large learning rates, thus removing the need for various tricks in the literature. Moreover, it enables GRBMs to generate samples starting from noise, thus allowing direct comparisons with deep generative models and improving evaluation protocols in the RBM literature. Experiments on Gaussian Mixtures, MNIST, FashionMNIST, and CelebA show GRBMs can generate good samples, despite their single-hidden-layer architecture. Our code is released: https://github.com/lrjconan/GRBM.
The recipe behind the success of deep learning has been the combination of neural networks and gradient-based optimization. Understanding the behavior of gradient descent however, and particularly its instability, has lagged behind its empirical success. To add to the theoretical tools available to study gradient descent we propose the principal flow (PF), a continuous time flow that approximates gradient descent dynamics. To our knowledge, the PF is the only continuous flow that captures the divergent and oscillatory behaviors of gradient descent, including escaping local minima and saddle points. Through its dependence on the eigendecomposition of the Hessian the PF sheds light on the recently observed edge of stability phenomena in deep learning. Using our new understanding of instability we propose a learning rate adaptation method which enables us to control the trade-off between training stability and test set evaluation performance.
Conventionally, the stability of stochastic gradient descent (SGD) is understood through a linear stability analysis, where the mean and variance of the parameter or the gradients are examined to determine the stability of SGD close to a stationary point. In this seminar, we discuss the limitations of linear stability theories and motivate a new notion of stability, which we call the probabilistic stability. We first explain why this notion of stability is especially suitable for understanding SGD at a large learning rate and a small batch size in toy problems. Then, with this new notion of stability, we study the implicit bias of SGD and show that SGD at a large learning rate converges to low-rank saddles in matrix factorization problems.
The talk is mainly based on the following two works:
[1] Liu Ziyin, Botao Li, James B. Simon, Masahito Ueda. SGD with a Constant Large Learning Rate Can Converge to Local Maxima. ICLR 2022.
[2] The Probabilistic Stability of SGD. (tentative title, in preparation)
Ordinary differential equation (ODE) is widely used in modeling biological and physical processes in science. In this talk, I will discuss a new reproducing kernel-based approach for estimation and inference of ODE given noisy observations. We do not assume the functional forms in ODE to be known, or restrict them to be linear or additive, and we allow pairwise interactions. We construct confidence intervals for the estimated signal trajectories, and establish the estimation optimality and selection consistency of kernel ODE under both the low-dimensional and high-dimensional settings, where the number of unknown functionals can be smaller or larger than the sample size. Our proposal builds upon the smoothing spline analysis of variance (SS-ANOVA) framework, but tackles several important problems that are not yet fully addressed, and thus extends the scope of existing SS-ANOVA too. Our proposal is also advantageous, in terms of statistical inference with noisy observations, compared to the existing physics-informed neural networks and sparsity-based methods.
Existing analyses of neural network training often operate under the unrealistic assumption of an extremely small learning rate. This lies in stark contrast to practical wisdom and empirical studies, such as the work of J. Cohen et al. (ICLR 2021), which exhibit startling new phenomena (the "edge of stability" or "unstable convergence") and potential benefits for generalization in the large learning rate regime. Despite a flurry of recent works on this topic, however, the latter effect is still poorly understood. In this paper, we take a step towards understanding genuinely non-convex training dynamics with large learning rates by performing a detailed analysis of gradient descent for simplified models of two-layer neural networks. For these models, we provably establish the edge of stability phenomenon and discover a sharp phase transition for the step size below which the neural network fails to learn "threshold-like" neurons (i.e., neurons with a non-zero first-layer bias). This elucidates one possible mechanism by which the edge of stability can in fact lead to better generalization, as threshold neurons are basic building blocks with useful inductive bias for many tasks.
The MBO scheme is a highly performant scheme used for data clustering. Given some data, one constructs the similarity graph associated to the data points. The goal is to split the data into meaningful clusters. The algorithm produces the clusters by alternating between diffusion on the graph and pointwise thresholding. In this talk I will present the first theoretical studies of the scheme in the large data limit. We will see how the final state of the algorithm is asymptotically related to minimal surfaces in the data manifold and how the dynamics of the scheme is asymptotically related to the trajectory of steepest descent for surfaces, which is mean curvature flow. The tools employed are variational methods and viscosity solutions techniques. Based on joint work with Tim Laux (U Bonn).
Modern machine learning methods, in particular deep learning approaches, have enjoyed unparalleled success in a variety of challenging application fields like image recognition, medical image reconstruction, and natural language processing. While a vast majority of previous research in machine learning mainly focused on constructing and understanding models with high predictive power, consensus has emerged that other properties like stability and robustness of models are of equal importance and in many applications essential. This has motivated researchers to investigate the problem of adversarial training (or how to make models robust to adversarial attacks), but despite the development of several computational strategies for adversarial training and some theoretical development in the broader distributionally robust optimization literature, there are still several theoretical questions about it that remain relatively unexplored. In this talk, I will take an analytical perspective on the adversarial robustness problem and explore three questions: 1)What is the connection between adversarial robustness and inverse problems?, 2) Can we use analytical tools to find lower bounds for adversarial robustness problems?, 3) How do we use modern tools from analysis and geometry to solve adversarial robustness problems? At its heart, this talk is an invitation to view adversarial machine learning through the lens of mathematical analysis, showcasing a variety of rich connections with perimeter minimization problems, optimal transport, mean field PDEs of interacting particle systems, and min-max games in spaces of measures.
Empirical observation of high dimensional phenomena, such as the double descent behavior, has attracted a lot of interest in understanding classical techniques such as kernel methods, and their implications to explain generalization properties of neural networks that operate close to kernel regime. Many recent works analyze such models in a certain high-dimensional regime where the covariates are generated by independent sub-Gaussian random variables transformed through a covariance matrix and the number of samples and the number of covariates grow at a fixed ratio (i.e. proportional asymptotics). In this work we show that for a large class of kernels, including the neural tangent kernel of fully connected networks, kernel methods can only perform as well as linear models in this regime.
More surprisingly, when the data is generated by a Gaussian process model where the relationship between input and the response could be very nonlinear, we show that linear models are in fact optimal, i.e. linear models achieve the minimum risk among all models, linear or nonlinear.
These results suggest that more complex models for the data other than independent features are needed for high-dimensional analysis.
Large-scale optimization is at the forefront of modern data science, scientific computing, and applied mathematics with areas of interest, including high-dimensional PDE, inverse problems, machine learning, etc. First-order methods are workhorses for large-scale optimization due to modest computational cost and simplicity of implementation. Nevertheless, these methods are often agnostic to the structural properties of the problem under consideration and suffer from slow convergence, being trapped in bad local minima, etc. Natural gradient descent is an acceleration technique in optimization that takes advantage of the problem's geometric structure and preconditions the objective function's gradient by a suitable "natural" metric. Hence parameter update directions correspond to the steepest descent on a corresponding "natural" manifold instead of the Euclidean parameter space rendering a parametrization invariant descent direction on that manifold. Despite its success in statistical inference and machine learning, the natural gradient descent method is far from a mainstream computational technique due to the computational complexity of calculating and inverting the preconditioning matrix. This work aims at a unified computational framework and streamlining the computation of a general natural gradient flow via the systematic application of efficient tools from numerical linear algebra. We obtain efficient and robust numerical methods for natural gradient flows without directly calculating, storing, or inverting the dense preconditioning matrix. We treat Euclidean, Wasserstein, Sobolev, and Fisher–Rao natural gradients in a single framework for a general loss function.
Neural networks are powerful feature extractors - but which features do they extract from their data? And how does the structure of the training data shape the representations they learn? We discuss two recent works in this direction. We first develop a simple model for images and show that a neural network trained on these images can learn a convolution from scratch [1]. This pattern-formation process is driven by a combination of translation-invariance of the "images" and the non-Gaussian, higher-order statistics of the inputs. Second, we conjecture a "distributional simplicity bias" whereby neural networks learn increasingly complex distributions of their inputs during training. We present analytical and experimental evidence for this conjecture, going from a simple perceptron up to deep ResNets [2].References:[1] Ingrosso & Goldt, PNAS 119 (40), e2201854119, arXiv:2202.00565[2] Refinetti, Ingrosso & Goldt, in preparation.
We study the optimization landscape of deep linear neural networks with the square loss. It is known that, under weak assumptions, there are no spurious local minima and no local maxima. However, the existence and diversity of non-strict saddle points, which can play a role in first-order algorithms' dynamics, have only been lightly studied. We go a step further with a full analysis of the optimization landscape at order 2. We characterize, among all critical points, which are global minimizers, strict saddle points, and non-strict saddle points. We enumerate all the associated critical values. The characterization involves conditions on the ranks of partial matrix products, and sheds some light on global convergence or implicit regularization that have been proved or observed when optimizing linear neural networks. In passing, we provide an explicit parameterization of the set of all global minimizers and exhibit large sets of strict and non-strict saddle points.
We will study the dynamics of Gradient Flow when training a depth 2 ReLU network on univariate data in a binary classification setting. Assuming our data is labelled by a teacher network with width , we will show a convergence guarantee that holds if the learned student network is sufficiently over-parameterized. Moreover, we will show that the implicit bias in our setting is such that the student network converges to a network with at most linear regions, which results in an intuitive generalization bound. Overall, this provides an end-to-end learnability result in this setting, demonstrating the interplay between optimization and generalization: Over-parameterization facilitates optimization, while the implicit bias prevents overfitting.
Understanding the fundamental principles behind the massive success of neural networks is one of the most important open questions in deep learning. However, due to the highly complex nature of the problem, progress has been relatively slow. In this note, through the lens of infinite-width networks, a.k.a. neural kernels, we present one such principle resulting from hierarchical localities. It is well-known that the eigenstructure of infinite-width multilayer perceptrons (MLPs) depends solely on the concept frequency, which measures the order of interactions. We show that the topologies from deep convolutional networks (CNNs) restructure the associated eigenspaces into finer subspaces. In addition to frequency, the new structure also depends on the concept space, which measures the spatial distance among nonlinear interaction terms. The resulting fine-grained eigenstructure dramatically improves the network's learnability, empowering them to simultaneously model a much richer class of interactions, including Long-Range-Low-Frequency interactions, Short-Range-High-Frequency interactions, and various interpolations and extrapolations in-between. Additionally, model scaling can improve the resolutions of interpolations and extrapolations and, therefore, the network's learnability. Finally, we prove a sharp characterization of the generalization error for infinite-width CNNs of any depth in the high-dimensional setting. Two corollaries follow: (1) infinite-width deep CNNs can break the curse of dimensionality without losing their expressivity, and (2) scaling improves performance in both the finite and infinite data regimes.
Given one optimization problem, consider pairing it with another by smoothly parametrizing the domain. This is done either for practical purposes (e.g., to use smooth optimization algorithms with good guarantees) or for theoretical purposes (e.g., to reveal that the landscape satisfies a strict saddle property). In both cases, the central question is: how do the landscapes of the two problems relate? Surprisingly, the relation is often determined by the parametrization itself; it is almost entirely independent of the cost function. In this talk, I will present a new geometric framework for studying parametrizations according to their effect on landscapes. Applications include: optimization over low-rank matrices and tensors by optimizing over a factorization; the Burer-Monteiro approach to semidefinite programs; training neural networks by optimizing over their weights and biases; and quotienting out symmetries. Joint with Eitan Levin (Caltech) and Nicolas Boumal (EPFL).
Modern neural networks are usually over-parameterized—the number of parameters exceeds the number of training data. In this case the loss functions tend to have many (or even infinite) global minima, which imposes an additional challenge of minima selection on optimization algorithms besides the convergence. Specifically, when training a neural network, the algorithm not only has to find a global minimum, but also needs to select minima with good generalization among many other bad ones. In this talk, I will share a series of works studying the mechanisms that facilitate global minima selection of optimization algorithms. First, with a linear stability theory, we show that stochastic gradient descent (SGD) favors flat and uniform global minima. Then, we build a theoretical connection of flatness and generalization performance based on a special structure of neural networks. Next, we study the global minima selection dynamics—the process that an optimizer leaves bad minima for good ones—in two settings. For a manifold of minima around which the loss function grows quadratically, we derive effective exploration dynamics on the manifold for SGD and Adam, using a quasistatic approach. For a manifold of minima around which the loss function grows subquadratically, we study the behavior and effective dynamics for GD, which also explains the edge of stability phenomenon.
Dynamic processes have been modeled successfully for hundreds of years, often using ordinary or partial differential equations. Using data-driven methods, these processes can now also be inferred directly from measurements. In this talk, I will provide an overview of our work in this direction. I will discuss learning differential equations on reduced spaces, how to utilize numerical integration schemes to train neural networks for stochastic dynamics, and close with an alternative view on system identification with the Koopman operator framework.
Deep learning comes with excessive demands for data. In this talk, I will present my recent work on showing that not all data is necessary for training an accurate predictor. In particular, one can drop “easy-to-learn” examples, and do just as well as learning on all of the data. Given this disparate “importance” of training data on generalization, I will present empirical analysis of the loss landscape derived from different subsets of the training examples. I will then look into how the training dynamics are influenced by easy versus hard data.
Wasserstein GANs (WGANs) are based on the idea of minimising the Wasserstein distance between a real and a generated distribution. In this talk, we provide an in-depth mathematical analysis of differences between the theoretical setup and the reality of training WGANs. We gather both theoretical and empirical evidence that the WGAN loss is not a meaningful approximation of the Wasserstein distance. Moreover, we argue that the Wasserstein distance is not even a desirable loss function for deep generative models, and conclude that the success of WGANs can be attributed to a failure to approximate the Wasserstein distance.
This talk will discuss two aspects of optimization geared towards improving generalization in Deep Learning, involving: i) regularization effect of SGD, and ii) model averaging. The first part of the talk will describe a mechanism that captures the implicit *Regularization Effect of SGD*. We show that it depends on the Fisher Information Matrix and replicate SGD’s implicit regularization effect with an explicit form. We will then delve into OOD generalization where we demonstrate that the difference in performance of deep models on in-domain validation sets and distribution-shifted test sets varies in a chaotic manner during training using ERM. This leads to poor model selection when using in-domain validation data for early stopping, and makes the trained model unreliable. We discuss this problem and propose *Model Averaging* as an approach of mitigation, leading to state-of-the-art performance on the DomainBed benchmark. I will conclude with a preview of my future research directions in the broad area of OOD generalization. Bio:I am currently a senior research scientist at Salesforce AI Research. Prior to this, I was a postdoc at Mila with Prof. Yoshua Bengio. I received my PhD and M.S. degrees in Computer Science and Engineering from University at Buffalo, and B.S. degree in Electrical Engineering from IIT-BHU, India.I am interested in representation learning analysis and algorithms for supervised/self-supervised learning, generative modeling, and more recently out of distribution generalization. My interest also lies at the intersection of optimization and generalization in deep learning, specifically, aspects of deep learning optimization that improve generalization. I have recently also worked on time series analytics in which I designed deep learning algorithms for forecasting and anomaly detection. I am currently developing a toolbox for causal discovery for tabular and time series data, as well as conducting research in this area.
Many applications involve non-Euclidean data, such as graphs, strings, or matrices. Exploiting such geometric structure in optimization can deliver algorithms that are computationally superior to standard nonlinear programming approaches. This observation has resulted in an increasing interest in Riemannian methods in the optimization and machine learning community. In this talk, we consider the problem of optimizing a function on a (Riemannian) manifold subject to convex constraints. Several classical problems that arise in Machine Learning applications can be phrased as constrained optimization on manifolds. This includes matrix-valued tasks, such as barycenter problems, as well as the computation of Brascamp-Lieb constants. The latter is of central importance in many areas of mathematics and computer science through connections to maximum likelihood estimators in Gaussian models, Tyler’s M-estimator of scatter matrices and Operator Scaling. We introduce Riemannian Frank-Wolfe methods, a class of first-order methods for solving constrained optimization problems on manifolds and present a global, non-asymptotic convergence analysis. We further discuss a class of CCCP-style algorithms for Riemannian “difference of convex” functions and explore connections to constrained optimization. For both methods we discuss applications to the two problems described above. Based on joint work with Suvrit Sra.
We introduce and explore the problem of Selectively Forgetting a particular subset of the data used for training a deep neural network. We propose an initial method for "scrubbing'" the weights clean by minimizing the Forgetting Lagrangian, such that any probing function of the weights is indistinguishable from the same function applied to the weights of a network trained without the data to be forgotten. This method is improved upon by using a deterministic update of the linearized version of the model inspired by NTKs, which enables us to bound the extracted information in a black-box setting. This inspired us to propose Linear Quadratic Fine-tuning (LQF), which is the first method for linearizing a pre-trained model which achieves comparable performance to non-linear fine-tuning. LQF allows us to exploit the strength of deep neural networks while enjoying the theoretical properties of convex optimization. Using LQF, we introduce a novel notion of forgetting in the Mixed-Privacy setting which enables us to perform forgetting for large scale vision tasks, while providing theoretical guarantees. We further extend this mixed privacy setting to Differential Privacy (DP) and introduce AdaMix, an adaptive DP algorithm which exploits few-shot public data to improve the privacy trade-off for practical vision tasks.
In this talk I am going to talk about two problems that Message Passing Neural Networks (MPNNs) have been shown to be struggling from. The first one -- known as over-squashing -- is unavoidable in the MPNN class and concerns the input graph topology. This relates to how information propagates in a graph. We show that discrete curvature quantities (old and new) could help us understanding where messages are being lost and we can provably characterize the over-squashing phenomenon in terms of curvature. The second problem consists in analysing GNNs as multi-particle dynamics using the lens of gradient flows of an energy. We investigate what happens when instead of learning the MPNN equations we learn an energy and then let the equations follow the gradient flow of such energy. This allows us to understand further the role of the channel-mixing matrix that is ubiquitous in standard graph convolutional models as a bilinear potential inducing both attraction and repulsion along edges via its positive and negative eigenvalues respectively.
Understanding how deep neural networks generalize remains notoriously challenging in theory. This talk will motivate and examine a simpler question, that is, the generalization of high-dimensional linear regression models, using several empirical high-performing real-world neural-network-induced settings as a testbed (e.g. the empirical NTK of a pretrained ResNet applied to CIFAR-100). We find that, perhaps surprisingly, even in these linear settings, most existing theoretical analyses for linear/kernel regression fail to qualitatively capture the empirical generalization phenomena. On the other hand, a random matrix theory hypothesis gives rise to an estimator that accurately predicts generalization. Based on https://arxiv.org/abs/2203.06176 (ICML 2022). Bio: Wei Hu is a postdoc at UC Berkeley and an incoming assistant professor at the University of Michigan. He obtained his Ph.D. from Princeton University advised by Sanjeev Arora, and his bachelor's degree from Tsinghua University. He is interested in the theoretical and scientific foundations of modern machine learning, with a particular focus on deep learning.
Kernel ridge regression (KRR) has recently attracted renewed interest due to its potential for explaining the transient effects, such as double descent, that emerge during neural network training. In this work, we study how the alignment between the target function and the kernel affects the performance of the KRR. We focus on the truncated KRR (TKRR) which utilizes an additional parameter that controls the spectral truncation of the kernel matrix. We show that for polynomial alignment, there is an over-aligned regime, in which TKRR can achieve a faster rate than what is achievable by full KRR. The rate of TKRR can improve all the way to the parametric rate, while that of full KRR is capped at a sub-optimal value. This shows that target alignemnt can be better leveraged by utilizing spectral truncation in kernel methods. We also consider the bandlimited alignment setting and show that the regularization surface of TKRR can exhibit transient effects including multiple descent and non-monotonic behavior. Our results show that there is a strong and quantifable relation between the shape of the alignment spectrum and the generalization performance of kernel methods, both in terms of rates and in finite samples.
Over-parameterized machine learning models lead to excellent performance in a variety of applications. In this talk we consider the effect of over-parameterization on stochastic optimization algorithms. We discuss how over-parameterization allows us to use a constant step size within stochastic gradient methods, and that this leads to a faster convergence rate. We also present algorithms with provably-faster convergence rates in the over-parameterized setting. Finally, we discuss how over-parameterization allows us to update the learning rate during the training procedure which leads to improved performance over a variety of previous approaches.
While over-parameterization is widely believed to be crucial for the success of optimization for the neural networks, most existing theories on over-parameterization do not fully explain the reason -- they either work in the Neural Tangent Kernel regime where neurons don't move much, or require an enormous number of neurons. In this talk I will describe our recent works towards understanding training dynamics that go beyond kernel regimes with only polynomially many neurons (mildly overparametrized). In particular, we first give a local convergence result for mildly overparametrized two-layer networks. We then analyze the global training dynamics for a related overparametrized tensor model. For both works, we rely on a key intuition that neurons in overparametrized models work in groups and it's important to understand the behavior of an average neuron in the group. Based on two works: https://arxiv.org/abs/2102.02410 and https://arxiv.org/abs/2106.06573
In this talk we discuss almost sure convergence of Stochastic Gradient Descent in discrete and continuous time for a given twice continuously-differentiable target function F. In a first step we give assumptions on the step-sizes and perturbation size to ensure convergence of the target value F and gradient of F assuming that grad F is locally Hölder-continuous. This result entails convergence of the iterates itself in the case where F does not possess a continuum of critical points. In a general non-convex setting with F possibly containing a rich set of critical points, convergence of the process itself is sometimes taken for granted, but actually is a non-trivial issue as there are solutions to the gradient flow ODE for smooth target functions that stay in a compact set but do not converge. Using the Lojasiewicz-inequality we give sharp bounds on the step-sizes and the size of the perturbation in order to guarantee convergence of the SGD scheme for analytic target functions. Also, we derive the convergence rate of the function value under the assumptions that F satisfies a particular Lojasiewicz-inequality with exponent in [1/2,1). Finally, we compare the discrete and continuous time results and discuss optimality of the assumptions. This is joint work with Steffen Dereich (WWU Münster).
Wasserstein GANs (WGANs) are based on the idea of minimising the Wasserstein distance between a real and a generated distribution. In this talk, we provide an in-depth mathematical analysis of differences between the theoretical setup and the reality of training WGANs. We gather both theoretical and empirical evidence that the WGAN loss is not a meaningful approximation of the Wasserstein distance. Moreover, we argue that the Wasserstein distance is not even a desirable loss function for deep generative models, and conclude that the success of WGANs can be attributed to a failure to approximate the Wasserstein distance.
There is an increasing realization that algorithmic inductive biases are central in preventing overfitting; empirically, we often see that machine learning models trained by stochastic gradient descent (SGD) generalize well, even in the overparameterized settings and under little to no explicit regularization. This work considers this issue in arguably the most basic setting: averaged or last-iterate SGD for linear regression in the overparameterized regime. Our main result provides a sharp excess risk bound, stated in terms of the full eigenspectrum of the data covariance matrix, that reveals a bias-variance decomposition characterizing when generalization is possible: (i) the variance bound is characterized in terms of an effective dimension (specific for SGD) and (ii) the bias bound is characterized in terms of the optimal model parameter and how it aligns with the data covariance matrix. Moreover, for SGD with iterate averaging, we prove a matching lower bound (up to constant factors); for last iterate SGD (with decaying stepsizes), we provide a matching lower bound (up to constant factors) when the signal-to-noise ratio is bounded. The above results have the following additional implications:(1) We compare the implicit regularization afforded by SGD with the explicit regularization of ridge regression. We find that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances. (2) We study the effects of transfer learning with pretraining and finetuning in the presence of covariate shift. We find that for a large class of linear regression instances, transfer learning with O(N^2) source data (and scarce or no target data) is as effective as supervised learning with N target data. References:https://arxiv.org/abs/2103.12692https://arxiv.org/abs/2108.04552https://arxiv.org/abs/2110.06198
Models built with deep neural network (DNN) can handle complicated real-world data extremely well, without suffering from the curse of dimensionality or the non-convex optimization. To contribute to the theoretical understanding of deep learning, we will investigate the nonparametric aspects of DNNs by addressing the following questions: (i) what kind of data can be best learned by deep neural networks? (ii) can deep neural networks achieve the statistical optimality? (iii) is there any algorithmic guarantee for obtaining such optimal neural networks? Our theoretical analysis applies to two most fundamental setup in practice: regression and classification.
Designing deep learning models for practical applications, including those in machine learning, is an art that often involves an expensive search over candidate architectures. We develop novel frameworks to facilitate the process of designing deep learning models via three principled approaches: optimization, numerical analysis, and statistical modeling. First, using optimization as a tool, we develop new families of momentum-based deep learning models that take advantage of momentum methods to improve the convergence speed and ability to capture long-range dependencies in models, including deep neural networks and neural ordinary differential equations. Second, we employ numerical methods such as diffusion with a source term to develop new graph neural networks with better accuracy and efficiency. Finally, we develop new generative models that help explain the attention mechanism in transformers and suggest new directions to improve the performance of these models.
Modern machine learning models, particularly those used in deep networks, are characterized by massive numbers of parameters trained on large data sets. While these large-scale models have had tremendous practical successes, developing theoretical methods that can rigorously explain when and why these models work, has been a major issue in the field. This task is even made harder by the non-convexity of the underlying learning problems. In this talk, we shed light on the theoretical understanding of the asymptotics of learning for two popular neural network models, namely, Generalized Linear Models (GLMs) and Recurrent Neural Networks (RNNs). First, we investigate the generalizability of single-layer neural networks (i.e., GLMs) over previously unseen data. We provide a general framework to characterize the asymptotic generalization error for GLMs with arbitrary non-linearities, making it applicable to regression as well as classification problems. This framework enables analyzing the effect of (i) over-parameterization and non-linearity during modeling; (ii) choices of the loss function, initialization, and regularizer during learning; and (iii) mismatch between training and test distributions. We also rigorously and analytically explain the double descent phenomenon in generalized linear models. Secondly, we explore the learning dynamics of recurrent neural networks under gradient descent. We focus on a subclass of RNNs with linear activations and provide precise reasoning for the common knowledge suggesting that RNNs do not perform well on tasks requiring long-term memory. Using recently-developed kernel regime analysis, our main result shows that linear RNNs learned from random initializations are functionally equivalent to a certain weighted 1D-convolutional network. Importantly, the weightings in the equivalent model cause an implicit bias to elements with smaller time lags in the convolution and hence shorter memory. We show that the degree of this bias depends on the variance of the transition matrix at initialization. Our theories are validated with both synthetic and real data experiments.
We examine two data science questions motivated by two biological problems: (1) modeling time varying microbial communities and (2) modeling shapes and surfaces for evolutionary applications. Question 1: Under what conditions are Bayesian methods for inference in deterministic dynamical systems with observation noise consistent. We will highlight the relation between Bayesian inference and a classical idea in dynamical systems called the thermodynamic formalism. We will provide a partial answer to Question 1 based on a variational characterization of a partition function. Question 2: Given a set of shapes the synchronization problem is to consistently register or aligning the shapes. We develop a geometric framework that characterizes the synchronization problem. We use classic tools from the theory of fibre bundles in the geometric characterization. We then generalize the graph theoretic notion of a Laplacian to quantify obstructions to synchronization, specifically we develop a twisted Hodge theory. We then state an algorithm that learns group actions and demonstrate it's efficacy by clustering the molars of primates based on their shape. We will see the clusters correspond to eating habits.
Modern neural networks often have great expressive power and can be trained to overfit the training data, while still achieving a good test performance. This phenomenon is referred to as “benign overfitting”. Recently, there emerged a line of works studying “benign overfitting” from the theoretical perspective. However, they are limited to linear models or kernel/random feature models, and there is still a lack of theoretical understanding about when and how benign overfitting occurs in neural networks. In this talk, I will give an answer to the above question. We precisely characterize the conditions under which benign overfitting can occur in training two-layer convolutional neural networks. We show that when the signal-to-noise ratio satisfies a certain condition, a two-layer CNN trained by gradient descent can achieve arbitrarily small training and test loss. On the other hand, when this condition does not hold, overfitting becomes harmful and the obtained CNN can only achieve a constant level of test loss. These together demonstrate a sharp phase transition between benign overfitting and harmful overfitting, driven by the signal-to-noise ratio. This talk is based on joint work with Yuan Cao, Mikhail Belkin, and Quanquan Gu.
The mysterious ability of neural networks to generalize is believed to stem from an implicit regularization — a tendency of gradient-based optimization to fit training data with predictors of low “complexity.” Despite vast efforts, a satisfying formalization of this intuition is lacking. In this talk I will present a series of works theoretically analyzing the implicit regularization in matrix and tensor factorizations, known to be equivalent to certain linear and non-linear neural networks, respectively. Through dynamical characterizations I will establish an implicit regularization towards low rank (for corresponding notions of rank), different from any type of norm minimization, in contrast to prior beliefs. I will then discuss implications of this finding to both theory (possible explanation for generalization over natural data) and practice (compression of neural network layers, novel regularization schemes). Overall, our results highlight the potential of ranks to explain and improve generalization in deep learning. Works covered in this talk were done in collaboration with Asaf Maman and Nadav Cohen.
In this talk, I will present some implications of the permutation-symmetry in neural networks on the shape of the loss landscape. I will first introduce a type of saddle point, so-called symmetry-induced saddles, emerging from a particular arrangement of neurons in deep neural networks. Then we will describe the precise geometry of the global minima manifold of overparameterized networks in a teacher-student setting. Counting the possible arrangements of neuron groups inside a neural network, we will give the numbers of symmetry-induced saddle manifolds and the components of the global minima manifold in terms of the student and the teacher widths. Our analysis shows that overparameterization gradually smoothens the landscape due to a faster scaling of the global minima manifold components than the symmetry-induced saddle manifolds. Yet the landscape exhibits roughness in mildly overparameterized networks; we empirically show that gradient-based training finds a zero-loss solution only for a fraction of initializations in this regime.
In the era of big data, the training data for machine learning models is commonly collected from multiple sources. Some of these might not be unreliable (noisy, corrupted, or even manipulated). Can learning algorithms overcome this an still learn classifiers of optimal accuracy and ideally fairness? In my talk, I highlight recent results from our group that establish situations in which this is possible or impossible.
In this talk, I will discuss a remarkable phenomenon of wide neural networks: transition to linearity. The phenomenon can be described as follow: as network width increases to infinity, the neural network becomes a linear model with respect to its parameters, in any O(1)-neighborhoods of the random initialization. This phenomenon underlines the widely known constancy of neural tangent kernel of certain infinitely wide neural networks. Aiming to give a more intuitive understand, I will describe in a geometric point of view, how this somehow unexpected phenomenon, as well as the constancy of NTK, arises as a natural consequence of assembling a large number of submodels.
Reinforcement learning (RL) is notoriously sample inefficient, prompting a large body of prior work to study how unsupervised pretraining can improve sample efficiency when solving downstream RL tasks. One approach is unsupervised skill discovery, a class of algorithms that learn a set of policies without access to a reward function. While prior work has shown that these methods learn skills that can accelerate downstream RL tasks, it remains unclear whether these methods always learn useful skills. What does it even mean for a skill (or set of skills) to be probably useful? In this talk, I'll share some recent work that provides some of the first answers to this question: existing methods are optimal one one sense but very suboptimal in a different sense. These results have implications for how users might want to use skill learning algorithms, provide some (surprisingly simple!) tools for analysing skill learning methods, and suggest exciting opportunities for designing new, provably-useful skill learning algorithms.
Most practical algorithms for supervised machine learning boil down to optimizing the average performance over a training dataset. However, it is increasingly recognized that although the optimization objective is the same, the manner in which it is optimized plays a decisive role in the properties of the resulting predictor. For example, when training large neural networks, there are generally many weight combinations that will perfectly fit the training data. However, gradient-based training methods somehow tend to reach those which, on the one hand, do not overfit, and on the other hand, are very brittle to adversarially crafted examples. Why the dynamics of these methods lead to such "implicit biases" is still far from being fully understood. In this talk, I'll describe several recent theoretical results related to this question, in the context of benign overfitting and adversarial examples. Based on joint work with Gal Vardi and Gilad Yehudai.
We study the convergence properties of gradient descent for training deep linear neural networks, i.e., deep matrix factorizations, by extending a previous analysis for the related gradient flow. We show that under suitable conditions on the step sizes gradient descent converges to a critical point of the loss function, i.e., the square loss in this work. Furthermore, we demonstrate that for almost all initializations gradient descent converges to a global minimum in the case of two layers. In the case of three or more layers we show that gradient descent converges to a global minimum on the manifold matrices of some fixed rank, where the rank cannot be determined a priori.
Since the proposal of the graph neural network (GNN) by Gori et al. (2005) and Scarselli et al. (2008), one of the major problems in training GNNs was their struggle to propagate information between distant nodes in the graph. We propose a new explanation for this problem: GNNs are susceptible to a bottleneck when aggregating messages across a long path. This bottleneck causes the over-squashing of exponentially growing information into fixed-size vectors. As a result, GNNs fail to propagate messages originating from distant nodes and perform poorly when the prediction task depends on long-range interaction. Initially, we hypothesized that thanks to their attention mechanism, Graph Attention Networks (GAT) are more resistant to over-squashing than GNNs that absorb incoming edges equally, such as GCN and GIN. However, we found that GAT computes a very limited kind of attention: the ranking of the attention scores is unconditioned on the query node. We formally define this restricted kind of attention as static attention and distinguish it from a strictly more expressive dynamic attention. Because GATs use a static attention mechanism, there are simple graph problems that GAT cannot express. To remove this limitation, we introduce a simple fix by modifying the order of operations and propose GATv2, which is now available as part of the PyTorch Geometric library, the Deep Graph Library, and the TensorFlow GNN library. The talk will focus on the following papers:On the Bottleneck of Graph Neural Networks and its Practical Implications (ICLR'2021): https://arxiv.org/pdf/2006.05205How Attentive are Graph Attention Networks? (ICLR'2022): https://arxiv.org/pdf/2105.14491
Since the celebrated works of Russo and Zou (2016, 2019) and Xu and Raginsky (2017), it has been well known that the generalization error of supervised learning algorithms can be bounded in terms of the mutual information between their input and the output, given that the loss of any fixed hypothesis has a subgaussian tail. In this work, we generalize this result beyond the standard choice of Shannon's mutual information to measure the dependence between the input and the output. Our main result shows that it is indeed possible to replace the mutual information by any strongly convex function of the joint input-output distribution, with the subgaussianity condition on the losses replaced by a bound on an appropriately chosen norm capturing the geometry of the dependence measure. This allows us to derive a range of generalization bounds that are either entirely new or strengthen previously known ones. Examples include bounds stated in terms of p-norm divergences and the Wasserstein-2 distance, which are respectively applicable for heavy-tailed loss distributions and highly smooth loss functions. Our analysis is entirely based on elementary tools from convex analysis by tracking the growth of a potential function associated with the dependence measure and the loss function.
Feedforward neural networks explicitly describe a series of computations to be done given an input data point b. Implicit networks, on the other hand, describe a condition which must be met, e.g. a fixed point equation depending on b. Recent work has shown implicit networks can match the performance of feedforward networks on common machine learning tasks such as image recognition, while training using a fraction of the memory. In this talk we discuss recent advances in training implicit neural networks, as well as novel applications to problems in which the target output can naturally be thought of as a fixed point (e.g. game theory).
Understanding the properties of neural networks trained via gradient descent is at the heart of the theory of deep learning. In this talk, I will discuss two approaches to study the behavior of gradient descent methods. The first one takes a mean-field view and it relates the dynamics of stochastic gradient descent (SGD) to a certain Wasserstein gradient flow in probability space. I will show how this idea allows to study the connectivity, convergence and implicit bias of the solutions found by SGD. The second approach consists in the analysis of the Neural Tangent Kernel. I will present tight bounds on its smallest eigenvalue and show their implications on memorization and optimization in deep networks. Based on joint work with Adel Javanmard, Vyacheslav Kungurtsev, Andrea Montanari, Guido Montufar, Quynh Nguyen, and Alexander Shevchenko.
Deep reinforcement learning has enabled machines to solve complex control problems directly from high-dimensional camera inputs. However, these systems rely on carefully designed reward functions for specific tasks. On the other hand, humans learn about the world and perform complex behaviors without any external reward signal. We categorize the space of possible objective functions for embodied agents. We show a spectrum that reaches from narrow to general objectives. While the narrow objectives correspond to domain-specific rewards as typically used in reinforcement learning today, the general objectives correspond to information maximization through world models. This explains unsupervised learning, perception, exploration, skill discovery, and control from a single principle. Our findings suggest designing powerful world models as a path toward building highly adaptive agents that seek out large niches in their environments, rendering task rewards optional.
Training of neural networks is hard to describe theoretically due to complicated non-linear dependence of network predictions on parameters. However, the situation greatly simplifies in the limit of infinite network width, where the problem becomes quadratic with the matrix given by Neural Tangent Kernel (NTK). Such problems are more amenable to theoretical analysis and mostly described by spectral properties of linear operator and target function. In the first part of the talk, we will show that in certain scenarios spectrum of the NTK and eigendecomposition of target function are asymptotically described by power laws with simple explicit expression for their exponents. In the second part of the talk we will turn to general quadratic problems with power-law spectrum and give tight bounds for convergence speed of various Gradient Descent algorithms: vanilla Gradient Descent (GD), Heavy Ball (HB) method, GD and HB with predefined schedules, Steepest Descent and Conjugate Gradients. The talk is based on the joint work with Dmitry Yarotsky (arXiv:2105.00507 and arXiv:2202.00992).
We characterize the power-law asymptotics of learning curves for Gaussian process regression (GPR) under the assumption that the eigenspectrum of the prior and the eigenexpansion coefficients of the target function follow a power law. Under similar assumptions, we leverage the equivalence between GPR and kernel ridge regression (KRR) to show the generalization error of KRR. Infinitely wide neural networks can be related to GPR with respect to the neural network GP kernel and the neural tangent kernel, which in several cases is known to have a power-law spectrum. Hence our methods can be applied to study the generalization error of infinitely wide neural networks. We present toy experiments demonstrating the theory. This is a joint work with Pradeep Banerjee and Guido Montufar.
Inspired by the proposal of tangent kernels of neural networks (NNs), a recent research line aims to design kernels with a better generalization performance on standard datasets. Indeed, a few recent works showed that certain kernel machines perform as well as NNs on certain datasets, despite their separations in specific cases implied by theoretical results. Furthermore, it was shown that the induced kernels of convolutional neural networks perform much better than any former handcrafted kernels. These empirical results pose a theoretical challenge to understanding the performance gaps in kernel machines and NNs in different scenarios.
In this talk, we show that data structures play an essential role in inducing these performance gaps. We consider a few natural data structures, and study their effects on the performance of these learning methods. Based on a fine-grained high dimensional asymptotics framework of analyzing random features models and kernel machines, we show the following: 1) If the feature vectors are nearly isotropic, kernel methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best low-dimensional representation; 2) If the feature vectors display the same low-dimensional structure as the target function (the spiked covariates model), this curse of dimensionality becomes milder, and the performance gap between kernel methods and NNs become smaller; 3) On datasets that display some invariance structure (e.g., image dataset), there is a quantitative performance gain of using invariant kernels (e.g., convolutional kernels) over inner product kernels. Beyond explaining the performance gaps, these theoretical results can further provide some intuitions towards designing kernel methods with better performance.
Multi-armed bandit algorithms minimize experimentation costs required to converge on optimal behavior. They do so by rapidly adapting experimentation effort away from poorly performing actions as feedback is observed. But this desirable feature makes them sensitive to confounding, which is the primary concern underlying classical randomized controlled trials. We highlight, for instance, that popular bandit algorithms cannot address the problem of identifying the best action when day-of-week effects may confound inferences. In response, this paper proposes deconfounded Thompson sampling, which makes simple, but critical, modifications to the way Thompson sampling is usually applied. Theoretical guarantees suggest the algorithm strikes a delicate balance between adaptivity and robustness to confounding. It attains asymptotic lower bounds on the number of samples required to confidently identify the best action --- suggesting optimal adaptivity --- but also satisfies strong performance guarantees in the presence of day-of-week effects and delayed observations --- suggesting unusual robustness. These issues are explored through a very general model of contextual bandit experiments. At the core of the paper is a new model of contextual bandit experiments in which issues of delayed learning and distribution shift arise organically.