Recent advances in deep learning performance have all relied on scaling up the number of parameters within neural networks, consequently making asymptotic scaling limits a compelling approach to theoretical analysis. In this talk, we explore the proportional infinite-depth-and-width limit, where the role of depth can be adequately studied, and the limit remains a great model of finite size networks. At initialization, we characterize the limiting distribution of the network via a stochastic differential equation (SDE) for the feature covariance matrix. Furthermore, in the linear network setting, we can also characterize the spectrum of the covariance matrix in the large data limit via a geometric variant of Dyson Brownian motions. Finally, we will briefly discuss ongoing work towards analyzing training dynamics.
Reverse thinking plays a crucial role in human reasoning. Humans can reason not only from a problem to a solution but also in reverse, i.e., start from the solution and reason towards the problem. This often enhances overall reasoning performance as it enables consistency checks between their forward and backward thinking. To enable Large Language Models (LLMs) to perform reverse thinking, we introduce Reverse-Enhanced Thinking (RevThink), a framework composed of data augmentation and learning objectives. In RevThink, we augment the dataset by collecting structured forward-backward reasoning from a teacher model, consisting of: (1) the original question, (2) forward reasoning, (3) backward question, and (4) backward reasoning. We then employ three objectives to train a smaller student model in a multi-task learning fashion: (a) generate forward reasoning from a question, (b) generate a backward question from a question, and (c) generate backward reasoning from the backward question. Experiments across 12 datasets covering commonsense, math, and logical reasoning show an average 13.53% improvement over the student model's zero-shot performance and a 6.84% improvement over the strongest knowledge distillation baselines. Moreover, our method demonstrates sample efficiency -- using only 10% of the correct forward reasoning from the training data, it outperforms a standard fine-tuning method trained on 10x more forward reasoning. RevThink also exhibits strong generalization to out-of-distribution held-out datasets.
Deep neural networks excel at learning feature representations, setting them apart from traditional machine learning. This talk explores how feature learning emerges during neural network training and its role in foundation models’ adaptation. We provide theoretical insights into how networks efficiently learn class-relevant patterns early in training, leveraging data structures beyond kernel methods. Extending our analysis to Transformers, we examine Fourier features and the relationship between model scale and in-context learning. Building on these insights, we propose practical improvements, including nuclear norm regularization for domain generalization, a novel contrastive learning regularization, looped Transformers for multi-step gradient descent, and GemFilter for accelerating LLM inference. Our findings enhance understanding and efficiency in modern machine learning systems.
In this talk, we study the learning dynamics and convergence rates of gradient-based methods for two-layer linear models. We first study the dynamics of gradient descent during pretraining and show that the optimization problem satisfies a local PL condition and a local Descent Lemma, which lead to a linear convergence rate for a suitable choice of the step sizes. Compared to prior work, our results require no restrictive assumptions on width, initialization, or step sizes, and achieve faster convergence rates. Next, we examine the finetuning stage through the lens of low-rank adaptation for matrix factorization. We show that gradient flow converges to a neighborhood of the optimal solution and that smaller initializations yield lower final errors. Our analysis reveals how the final error depends on the misalignment between the singular spaces of the pretrained model and the target matrix, and it highlights how reducing the initialization scale can improve alignment and hence performance.
In this talk, we study the problem of feature learning in shallow neural networks. In the first part of the talk, we review the fundamental limitations of learning using two-layer neural networks with the first-layer weights fixed at initialization -- a model that does not learn features. We demonstrate that, in the high-dimensional proportional limit, this model is unable to learn nonlinear functions, a property known as Gaussian Equivalence. We then show that even a single step of gradient descent applied to the first layer can drastically alter this scenario. Through a precise analysis of the spectrum of the feature matrix, we illustrate how this one-step update breaks Gaussian Equivalence.
In the second part of the talk, we revisit the problem of a two-layer network updated by one gradient descent step and also consider linear representation learning, another widely studied model of feature learning. We establish that in these problems, gradient descent is a suboptimal feature-learning algorithm under general input conditions beyond the typical isotropic assumption. Furthermore, we show that layer-wise preconditioning optimization methods emerge as the natural solution. Thus, we provide the first learning-theoretic motivations for these popular deep learning optimization algorithms.
Data imbalance presents a fundamental challenge in data analysis, where certain classes (minority classes) account for a small fraction of the training data compared with other classes (majority classes). Many existing techniques attempt to compensate for the underrepresentation of minority classes, yet in empirical deep learning, the issue is exacerbated by the large model size. Despite extensive heuristics, the statistical foundations underlying these methods remain underdeveloped, raising concerns about the reliability of machine learning models. Classical statistical theory—based on large-sample asymptotics and finite-sample corrections—often falls short in high-dimensional settings, leaving many overfitting phenomena unexplained.
In this talk, I will examine the problem of imbalanced classification in high dimensions, focusing on support vector machines (SVMs) and logistic regression. I will introduce a "truncation" phenomenon—a distortion of the training data's logit distribution due to overfitting in high dimensions—which we have observed across single-cell tabular data, image data, and text data. I will provide a theoretical foundation by characterizing asymptotic distribution via a variational formulation. This analysis formalizes the intuition that overfitting disproportionately harms minority classes and demonstrates how margin rebalancing—a widely used deep learning heuristic—helps mitigate data imbalance. Consequently, this theory provides both qualitative and quantitative insights into generalization errors and uncertainty quantification, particularly in the context of model calibration.
This talk is based on a joint work with Kangjie Zhou (Columbia) and my advisor Yiqiao Zhong (UW–Madison).
Link to the paper: https://arxiv.org/abs/2502.11323
Reproduction code: https://github.com/jlyu55/Imbalanced_Classification
In this talk I will discuss shallow neural networks with polynomial activations, with the viewpoint that understanding these networks can shed light on the behavior of more complex neural networks. I will focus on the relationship between width and optimization, as well as the effect that the data distribution has on the training landscape. Since the function space for shallow polynomial networks can be identified with symmetric tensors with bounded rank, our results may also be formulated in terms of symmetric tensor decomposition. Joint with Y. Arjevani, J. Bruna, E. Polak and M. Trager (https://arxiv.org/abs/2501.06074).
Matrices can be decomposed via rank-one approximations: the best rank-one approximation is a singular vector pair, and the singular value decomposition writes a matrix as a sum of singular vector pairs. The singular vector tuples of a tensor are the critical points of its best rank-one approximation problem. In this paper, we study tensors that can be decomposed via successive rank-one approximations: compute a singular vector tuple, subtract it off, compute a singular vector tuple of the new deflated tensor, and repeat. The number of terms in such a decomposition may exceed the tensor rank. Moreover, the decomposition may depend on the order in which terms are subtracted. We show that the decomposition is valid independent of order if and only if all singular vectors in the process are orthogonal in at least two factors. We study the variety of such tensors. We lower bound its dimension, showing that it is significantly larger than the variety of odeco tensors.
Studying the interplay between the geometry of the loss landscape and the optimisation trajectories of simple neural networks is a fundamental step for understanding their behaviour in more complex settings. We discuss the presence of a topological obstruction in the loss landscape of shallow ReLU neural networks trained using gradient flow. The homogeneous nature of the ReLU activation function constrains the training trajectories to lie on a product of quadric hypersurfaces whose shape depends on the particular initialisation of the network's parameters. We prove that these quadrics can have multiple connected components, limiting the set of reachable parameters during training. We analytically compute the number of these components and discuss the possibility of mapping one to the other through neuron rescaling and permutation.
We introduce a Transformer-based approach to computational enumerative geometry, specifically targeting the computation of -class intersection numbers on the moduli space of curves. Traditional methods for calculating these numbers suffer from factorial computational complexity, making them impractical to use. By reformulating the problem as a continuous optimization task, we compute intersection numbers across a wide value range from to . To capture the recursive nature inherent in these intersection numbers, we propose the Dynamic Range Activator (DRA), a new activation function that enhances the Transformer's ability to model recursive patterns and handle severe heteroscedasticity. Given precision requirements for computing the intersections, we quantify the uncertainty of the predictions using Conformal Prediction with a dynamic sliding window adaptive to the partitions of equivalent number of marked points. To the best of our knowledge, there has been no prior work on modeling recursive functions with such a high-variance and factorial growth. Beyond simply computing intersection numbers, we explore the enumerative "world-model" of Transformers. Our interpretability analysis reveals that the network is implicitly modeling the Virasoro constraints in a purely data-driven manner. Moreover, through abductive hypothesis testing, probing, and causal inference, we uncover evidence of an emergent internal representation of the the large-genus asymptotic of the -class intersection numbers. These findings suggest that the network internalizes the parameters of the asymptotic closed-form and the polynomiality phenomenon of psi -class intersection numbers in a non-linear manner. This talk is based on 2408.14915.
Besides classical feed-forward neural networks, neural ordinary differential equations (neural ODEs) have gained particular interest in recent years. Neural ODEs can be interpreted as an infinite depth limit of feed-forward or residual neural networks. In this presentation, we study the input-output dynamics of finite and infinite-depth neural networks with scalar output. In the finite depth case, the input maps under multiple non-linear transformations to the state of one output node. In analogy, a neural ODE maps a linear transformation of the input to a linear transformation of its time-T map. We show that depending on the structure of the network, the input-output map has different properties regarding the existence and regularity of critical points, which can be characterized via Morse functions. We prove that critical points cannot exist if the dimension of the hidden layer is monotonically decreasing or the dimension of the phase space is smaller or equal to the input dimension. If critical points exist, we classify their regularity depending on the specific architecture. The established theorems are comparable in the finite and infinite depth case and allow us to formulate results on universal embedding and approximation. Our dynamical systems viewpoint on the geometric structure of the input-output map provides a fundamental understanding of why specific architectures outperform others.
Tropical geometry is a relatively recent field in mathematics and computer science combining elements of algebraic geometry and polyhedral geometry. It has recently emerged successfully in the analysis and extension of several classes of problems and systems in both classical machine learning and deep learning. In this talk we will first summarize a few introductory ideas and tools of tropical geometry and its underlying max-plus arithmetic and matrix algebra. Then, we will focus on how this new set of tools can aid in the analysis, design and understanding of several classes of neural networks and other machine learning systems, including deep neural networks with piecewise-linear activations, morphological neural networks, neural network minimization, and nonlinear regression with piecewise-linear functions. Our coverage will include studying the representation power, training and pruning of these networks and regressors under the lens of tropical geometry and max-plus algebra. More information and related papers can be found in robotics.ntua.gr.
Biography
Petros Maragos received his Ph.D. degree from Georgia Tech, Atlanta, in 1985. Then, he joined the faculty of the Division of Applied Sciences at Harvard University, where he worked for 8 years as professor of EE affiliated with the Harvard Robotics Lab. In 1993 he joined the faculty of the School of ECE at Georgia Tech, affiliated with its Center for Signal & Image Processing. Since 1999, he has been working as professor at the NTUA, where he is currently the director of the Intelligent Robotics & Automation Lab. He has held visiting positions at MIT in 2012, at UPenn in 2016, and at USC in 2023. He is a co-founder and since 2023 the acting director of the Robotics Institute of the Athena RC. His research and teaching interests include computer vision & speech, machine learning, and robotics. He is the recipient of several awards including an NSF Presidential Young Investigator Award, Best Paper awards from IEEE journals and Computer Vision conferences, and the Technical Achievement award from EURASIP. For his research contributions he was elected a Fellow of IEEE in 1995 and a Fellow of EURASIP in 2010. He has served as IEEE Distinguished Lecturer for 2017-2018. Since 2023 he is a Life Fellow of IEEE.
Two-timescale learning algorithms are often applied in game theory and bi-level optimization, using distinct update rates for two interdependent processes. In this talk, I will discuss the convergence of two types of two-timescale algorithms.
The first type is the unified two-timescale Q-learning algorithm by Angiuli et al., effective for solving mean field game (MFG) and mean field control (MFC) problems by adjusting the learning rate ratio for mean field distribution and Q-functions. We provide a theoretical explanation for the algorithm’s bifurcated outcomes under fixed learning rates, contributing a Lyapunov function that integrates mean field distribution and Q-function iterates. This function ensures unified convergence across all learning rates under mild assumptions.
The second type is the two-timescale gradient descent-ascent (GDA) algorithm, designed to find Nash equilibria in min-max games with improved convergence properties. Using a PDE-inspired approach, we analyze convergence in both finite- and infinite-dimensional cases. For finite-dimensional quadratic min-max games, we examine long-time convergence in near quasi-static regimes through hypocoercivity. For mean-field GDA dynamics, we study convergence under finite-scale ratios using a mixed synchronous-reflection coupling technique.
In this talk, we introduce a novel capacity measure 2sED for statistical models based on the effective dimension. The new quantity provably bounds the generalization error under mild assumptions on the model. Furthermore, simulations on standard data sets and popular model architectures show that 2sED correlates well with the training error. For Markovian models, we show how to efficiently approximate 2sED from below through a layerwise iterative approach, which allows us to tackle deep learning models with a large number of parameters. Simulation results suggest that the approximation is good for different prominent models and data sets.
The characterization of the structure of the manifold of low-energy lying states in neural networks is among the most fundamental theoretical questions in machine learning. In recent years, many empirical studies on the landscape of neural networks and constraint satisfaction problems (CSP) have shown that the low-lying configurations are often found in complex connected structures, where zero-energy paths between pairs of distant solutions can be constructed.
In this talk, I will discuss the geometrical organization of solutions in two prototypical problems, the "binary" and the "negative" perceptron. I will show that wide flat minima arise as complex extensive structures from the coalescence of minima around ``high-margin'' (i.e. locally robust) configurations. In the binary perceptron, in turn, the constraint density at which a hard phase for algorithms emerges can be predicted by the disappearance of the widest and flattest minima.
Finally, I will introduce a novel analytical method for characterizing the typical energy barriers between two configurations sampled from the zero-temperature Gibbs measure of the problem. This approach will provide a detailed characterization of the shape of the solution space of the negative perceptron and one-hidden layer neural networks in the overparameterized regime.
The dependence structure of high-dimensional distributions is often modeled by graphical models. The problem of learning the graph underlying such distributions has received a lot of attention in statistics and machine learning. In problems of very high dimension, it is often too costly even to store the sample covariance matrix. We propose a model in which one can query single entries of the covariance matrix. We construct efficient algorithms for structure recovery in Gaussian graphical models with query complexity that is quasi-linear in the dimension. We present algorithms that work for trees and, more generally, for graphs of small treewidth. We also discuss hypothesis testing of properties of the underlying graph. The talk is based on joint work with Luc Devroye, Jakub Truszkowski, Vasiliki Velona, and Piotr Zwiernik.
As synthetic data becomes higher quality and proliferates on the internet, machine learning models are increasingly trained on a mix of human- and machine-generated data. Despite the successful stories of using synthetic data for representation learning, using synthetic data for generative model training creates “self-consuming loops” which may lead to training instability or even collapse, unless certain conditions are met. We provide an overview of research in this new and growing field, which started in July 2023 when a group of researchers showed that "Self Consuming Generative Models Go Mad." Since then, there have been many papers that study this self-consuming loop problem, but there have been very few works which try to solve it. We'll discuss many of these papers, including our recent work, "Self-Correcting Self-Consuming Loops for Generative Model Training" (ICML 2024). Our paper aims to stabilize self-consuming generative model training by introducing a "correction" function, which ensures that the synthetic training data is higher quality. We empirically validate the effectiveness of self-correcting self-consuming loops on the challenging human motion synthesis task, and observe that it successfully avoids model collapse, even when the ratio of synthetic data to real data is as high as 100%.
As Generative AI technologies are revolutionizing industries and our daily lives, what is going to happen to the role of the mathematician? In this talk, I will highlight recent breakthroughs in deep learning and AI and explore how current and future advancements might alter the way we do mathematics.
We highlight a perhaps important but hitherto unobserved insight: The attention module in a ReLU-transformer is a cubic spline. Viewed in this manner, this mysterious but critical component of a transformer becomes a natural development of an old notion deeply entrenched in classical approximation theory. Conversely, if we assume the Pierce--Birkhoff conjecture, then every spline is also an encoder. This gives a satisfying answer to the mathematical structure of ReLU-transformers.
In this talk, we will see how concentration inequalities for the sample mean, like those due to Bernstein and Hoeffding, are valid for any sample size but overly conservative, yielding unnecessarily wide confidence intervals. Motivated by applications to reinforcement learning we develop new results on transport distances and obtain new computable concentration inequalities with asymptotically optimal size, finite-sample validity, and sub-Gaussian decay. Beyond empirical averages, we will tie this to the concept of Gaussian universality that has been recently observed to hold for a large class of estimators. We will precisely determine the class of functions and, hence, estimators, for which it holds. Beyond the independent setting we will see how it holds for a class of dependent processes and see how this can allow us to gain insight into the behavior of data augmentation. Time permitting we will finish the talk by investigating what happens with structured data in the context of graph neural networks.
Despite the great practical success of deep learning, our theoretical understanding of the generalization of neural networks is still limited. Theoretical analysis can be facilitated by studying "linearized" neural networks, for example through neural tangent kernel theory. While this allows to leverage some classical theory for regularized kernel methods, only more recent theoretical results have enabled to study phenomena such as benign overfitting and double descent. In this talk, I will give an overview of my research on the generalization of neural network regression as well as some related works.
Scaling of machine learning models has significantly improved model capabilities, especially for self-supervised tasks. To scale efficiently, one must harmonize modeling efforts with good utilization of the computational resources at hand. In this talk I will review a selection of important LLM architectural choices with the perspective of efficiency. I will first walk through the high level mechanism of GPU computation along with their capabilities and constraints. I will dive deeper into a few aspects of the hardware and work to identify a few basic principles that guide hardware efficiency in the field. From there, I will motivate recent innovations in LLM execution and architecture from the lens of the hardware principles we identified, considering both LLM training and inference efficiency.
We investigate the complexity of deep neural networks through the lens of functional equivalence, which posits that different parameterizations can yield the same network function. Leveraging the equivalence property, we present a novel bound on the covering number for deep neural networks, which reveals that the complexity of neural networks can be reduced. Additionally, we demonstrate that functional equivalence benefits optimization, as overparameterized networks tend to be easier to train since increasing network width leads to a diminishing volume of the effective parameter space. These findings can offer valuable insights into the phenomenon of overparameterization and have implications for understanding generalization and optimization in deep learning.
Given a sequence of tokens, such as words, the task of next-token prediction is to predict the next-token conditional probability distribution. Decoder-only transformers have become effective models for this task, but their properties are still not fully understood. In particular, the largest number of distinct context sequences that a decoder-only transformer can interpolate next-token distributions for has not been established. To fill this gap, we prove upper and lower bounds on this number, which are equal up to a multiplicative constant. We prove these bounds in the general setting where next-token distributions can be arbitrary as well as the empirical setting where they are calculated from a finite number of document sequences. Our lower bounds are for one-layer transformers and our proofs highlight an important injectivity property satisfied by self-attention.
Mathematical reasoning is a pivotal component of human intelligence, crucial for advancing education and science. This talk delves into the development of language model systems capable of robust mathematical reasoning, marking a significant step toward realizing general artificial intelligence. We introduce multi-modal and knowledge-intensive benchmarks to assess the reasoning capabilities of large language models (LLMs) and vision-language models (VLMs) across real-world contexts, including visual information, tabular data, and scientific domains. This talk advances the field by presenting innovative retrieval and tool-augmented algorithms that enhance LLM capabilities. It concludes by analyzing the latest advances in mathematical reasoning within visual contexts, and highlighting the current challenges and future prospects.
In this talk, I will talk about a novel view on the problem of finding discriminant loci of the parametric system of polynomial equations as a machine learning problem, i.e., determining classification boundaries over the parameter space, with the classes being the number of real solutions of the system. The number of real solutions at various parameter points are obtained using the numerical polynomial homotopy continuation method. Machine learning techniques, such as deep learning and nearest neighbor, are employed to efficiently approximate the boundaries which in turn provides a decomposition of the parameter space in terms of the number of real solutions. It is shown that the proposed method can well-approximate complicated discriminant loci of various examples such as the Kuramoto model.
Online and bandit algorithms are powerful theoretical tools in machine learning, designed for scenarios requiring sequential decision-making with limited information. In this talk, I will explore two practical applications of these algorithms. The first application is online distribution shift adaptation, where the test data distribution changes over time. By making certain assumptions about the distribution shift, we can frame this problem within an online learning context and solve it using online algorithms. The second application involves identifying the best Large Language Model (LLM)-based method in resource-constrained settings. Due to the significant costs of time or money associated with querying LLMs, it's crucial to minimize the number of queries when identifying the best LLM-base method. We demonstrate that using multi-armed bandit algorithms combined with low-rank factorization allows us to efficiently identify the best method with far fewer queries compared to evaluating all candidate methods on all test data.
Linear networks are artificial neural networks that use linear activation functions. Despite their simplicity, these networks have the potential for understanding more complex architectures. In this talk, we focus on linear convolutional networks with arbitrary strides. The neuromanifold of such a network is a semialgebraic set, represented by a space of polynomials that admit specific factorizations. We introduce a recursive algorithm to derive polynomial equations whose common zeros define the Zariski closure of the neuromanifold. Additionally, we examine the algebraic complexity involved in training these networks using techniques from metric algebraic geometry. We show that the total number of complex critical points in optimizing these networks corresponds to the generic Euclidean distance degree of a Segre variety. This number is notably higher than the number of critical points found when training a fully connected linear network with the same number of parameters.
Activation functions play a pivotal role in deep neural networks, enabling them to tackle complex tasks like image recognition. However, activation functions also introduce significant challenges for deep learning theory, network dynamics analysis, and properties such as interpretability and privacy preservation. In this talk, we revisit the necessity of activation functions across various scenarios. Specifically, we explore expressing network outputs through high-degree interactions among input elements using multilinear algebra. This approach allows us to attain the necessary expressivity via these intricate interactions. Yet, the question remains: Is this expressivity alone sufficient for effective learning? Our recent research, presented at ICLR’24, unveils meticulously designed networks employing multilinear operations. Remarkably, these networks achieve strong performance even in demanding tasks, such as ImageNet image recognition.
Language models that have been trained to predict the next word over billions of text documents have been shown to also significantly predict brain recordings of people comprehending language. Understanding the reasons behind the observed similarities between language in machines and language in the brain can lead to more insight into both systems. Additionally, the human language system integrates information from multiple sensory modalities which puts text-only language models at a fundamental disadvantage as cognitive models. In this talk, we will discuss a series of recent works that make progress towards these questions along different dimensions. The unifying principle among these works that allows us to make scientific claims about why one black box (language model) aligns with another black box (the human brain) is our ability to make specific perturbations in the language model and observe their effect on the alignment with the brain. Building on this approach, these works reveal that the observed alignment is due to more than next-word prediction and word-level semantics and is partially related to joint processing of select linguistic information in both systems. Furthermore, we find that the brain alignment can be improved by training a language model to summarize narratives, and to incorporate auditory and visual information of a scene. Taken together, these works make progress towards determining the sufficient and necessary conditions under which language in machines aligns with language in the brain.
Bio: Mariya Toneva is a tenure-track faculty at the Max Planck Institute for Software Systems, where she leads the BrAIN (Bridging AI and Neuroscience) group. Her research is at the intersection of Machine Learning, Natural Language Processing, and Neuroscience, with a focus on building computational models of language processing in the brain that can also improve natural language processing systems. Prior to MPI-SWS, she was a C.V. Starr Fellow at the Princeton Neuroscience Institute and obtained her Ph.D. from Carnegie Mellon University in a joint program between Machine Learning and Neural Computation.
A mystery of modern neural networks is their surprising generalization power in overparametrized regime: they comprise so many parameters that they can interpolate the training set, even if actual labels are replaced by purely random ones; despite this, they achieve good prediction error on unseen data.
In this talk, we focus on the neural tangent (NT) model for two-layer neural networks, which is a simplified model. Under the isotropic input data, we first show that interpolation phase transition is around Nd ~ n, where Nd is the number of parameters and n is the sample size.
To demystify the generalization puzzle, we consider the min-norm interpolator and show that its test error/generalization error is largely determined by a smooth, low-degree component. Moreover, we find that nonlinearity of the activation function has an implicit regularization effect. These results offer new insights to recent discoveries in overparametrized models such as double descent phenomena.
Link to the paper: https://arxiv.org/abs/2007.12826
Bio: Yiqiao Zhong is currently an assistant professor at the University of Wisconsin—Madison, Department of Statistics. Prior to joining UW Madison, Yiqiao was a postdoc at Stanford University, advised by Prof. Andrea Montanari and Prof. David Donoho. His research interest includes analysis of Large Language Models, deep learning theory, and high-dimensional statistics. Yiqiao Zhong obtained his Ph.D. in 2019 from Princeton University, where he was advised by Prof. Jianqing Fan.
Understanding the geometric properties of gradient descent dynamics is a key ingredient in deciphering the recent success of very large machine learning models. A striking observation is that trained over-parameterized models retain some properties of the optimization initialization. This “implicit bias” is believed to be responsible for some favorable properties of the trained models and could explain their good generalization properties. In this work, we expose the definitions and properties of "conservation laws", that define quantities conserved during gradient flows of a given machine learning model, such as a ReLU network, with any training data and any loss. After explaining how to find the maximal number of independent conservation laws via Lie algebra computations, we provide algorithms to compute a family of polynomial laws, as well as to compute the number of (not necessarily polynomial) conservation laws. We obtain that on a number of architecture there are no more laws than the known ones, and we identify new laws for certain flows with momentum and/or non-Euclidean geometries.
Joint work with Sibylle Marcotte and Gabriel Peyré.
There are several competing strategies for achieving equivariance in neural networks. In this talk, we are going to take a theoretical closer look at two of them: data augmentation and architecture restriction. Our main question is when the two strategies are equivalent. The analysis will reveal that a geometrical relation between the network architecture and the space of equivariant linear layers will imply a weak equivalence of the two strategies. This talk is based on joint work with Fredrik Ohlsson and Oskar Nordenfors.
Training neural networks with first order optimisation methods is at the core of the empirical success of deep learning. The scale of initialisation is a crucial factor, as small initialisations are generally associated to a feature learning regime, for which gradient descent is implicitly biased towards simple solutions. This work provides a general and quantitative description of the early alignment phase, originally introduced by Maennel et al. (2018). For small initialisation and one hidden ReLU layer networks, the early stage of the training dynamics leads to an alignment of the neurons towards key directions. This alignment induces a sparse representation of the network, which is directly related to the implicit bias of gradient flow at convergence. This sparsity inducing alignment however comes at the expense of difficulties in minimising the training objective: we also provide a simple data example for which overparameterised networks fail to converge towards global minima and only converge to a spurious stationary point instead.
A core challenge in mathematical data science is to understand and leverage intrinsic structures of sets. With the rise of deep learning, however, the focus has been shifting more and more from explicit structural regularization in inverse problems and related fields to implicit regularization in massively overparametrized machine learning models. In this talk, I will present theoretical results on the implicit bias of gradient descent in overparametrized linear regression and matrix factorization. I will particularly discuss the role of reparametrization in the presented theory.
The variational lower bound (a.k.a. ELBO or free energy) is the central objective for many established as well as many novel algorithms for unsupervised learning. Different reformulations and generalizations of the ELBO have been used for different models including standard variational autoencoders (VAEs), beta-VAEs, diffusion models and many other probabilistic data models. Here we study a novel approach that allows for rewriting the ELBO as a sum of entropies. Given common probabilistic graphical models, we first show that the ELBO converges to an expression summing (positive and negative) entropy terms. The result applies under mild conditions, e.g., distributions defining the graphical model have to be in the exponential family. Secondly, we discuss entropy sums as a tool for the analysis of ELBO-based learning; and we will show how properties of entropies can be exploited for concrete models such as VAEs. Finally, we will present recent results showing how convergence to entropy sums can be used to transform ELBOs to entropy sum objectives, i.e., we will present example models for which entropy sums can directly serve as learning objectives. Numerical results of entropy-based analysis as well as entropy-based learning will support the presented theoretical results. The talk will close with a discussion of related research of my lab and of future work.
Work together with: Simon Damm, Dmytro Velychko, Jan Warnken, Zhenwen Dai, Dennis Forster, Asja Fischer
This talk will present a non-asymptotic analysis of recurrent neural networks trained with gradient descent for the regression problem in the kernel regime without massive overparameterization. Our in-depth analysis (i) provides sharp bounds on the network size and iteration complexity in terms of the sequence length, sample size and ambient dimension, and (ii) identifies the significant impact of long-term dependencies in the dynamical system on the performance bounds characterized by a cutoff point that depends on the modulus of continuity of the activation function. Remarkably, this analysis reveals that an appropriately-initialized recurrent neural network trained with samples can achieve near-optimality with a network size that scales only logarithmically with . This sharply contrasts with the prior works that require high-order polynomial dependency of on to establish strong regularity conditions. Our results are based on an explicit characterization of the class of dynamical systems that can be approximated and learned by recurrent neural networks via norm-constrained transportation mappings, and establishing local smoothness properties of the hidden state with respect to the learnable parameters.
Neural networks with a large number of parameters do not overfit due to implicit regularization favoring good networks. This challenges the adequacy of parameter count as a measure of complexity. To understand implicit regularization, we study the output set geometry. Its local dimension, called batch functional dimension, varies. The batch functional dimension is almost surely determined by hidden layer activation patterns and remains invariant to network parameterization symmetries. Empirical evidence shows that the batch functional dimension decreases during optimization, when computed on training inputs and test inputs, resulting in parameters with low batch functional dimensions. We call this phenomenon geometry-induced implicit regularization. The batch functional dimension depends on both network parameters and inputs. We investigate the maximum achievable batch functional dimension under varying inputs for fixed parameters. We demonstrate that this quantity, termed computable full functional dimension, remains invariant to network parameterization symmetries and is determined by the achievable activation patterns. In addition, we present a sampling theorem that highlights the rapid convergence in the estimation of the computable full functional dimension. Empirical findings suggest that it is close to the number of parameters, which is related to the local identifiability of the parameters.
Neural networks are renowned for their ability to memorize datasets, often achieving near-zero training loss via gradient descent optimization. Despite this capability, they also demonstrate remarkable generalization to new data. This paper delves into studying the generalization behavior of neural networks trained with logistic loss through the lens of algorithmic stability. Our focus lies on the neural tangent regime, where network weights move a constant distance from initialization to solution to achieve minimal training loss. Our main finding reveals that under NTK-separability, optimal test loss bounds are achievable if the network width is at least poly-logarithmically large with respect to the number of training samples. This departure from existing generalization outcomes using algorithmic stability, which typically require polynomial width and yield suboptimal rates, underscores the significance of our approach. Moreover, our analysis presents improved generalization bounds and width lower bounds compared to prior works employing alternative methods such as uniform convergence via Rademacher complexity. The key to this improvement lies in leveraging the Hessian information of the objective function during gradient descent iterates. We demonstrate that neural networks of sufficiently large width trained by the logistic loss satisfy an approximate quasi-convexity property along the gradient descent path. To demonstrate the practical implications of our findings, we specialize our analysis to a XOR dataset, where we present refined width conditions.
Biography: Hossein Taheri received the B.Sc. degree in electrical engineering and mathematics from the Sharif University of Technology, Tehran, Iran, in 2018. He is currently pursuing the Ph.D. degree in electrical and computer engineering under the guidance of Christos Thrampoulidis at the University of California at Santa Barbara. His main area of research is statistical learning and optimization.
In this talk, I will present a framework, inspired by random matrix theory, for analyzing the dynamics of stochastic optimization algorithms (e.g., stochastic gradient descent (SGD) and momentum (SGD + M)) when both the number of samples and dimensions are large. Using this new framework, we show that the dynamics of optimization algorithms on generalized linear models and multi-index problems with random data become deterministic in the large sample and dimensional limit. In particular, the limiting dynamics for stochastic algorithms are governed by an ODE. In the least square setting, from this model, we identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of O(1/ κ), matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD+M has rates that scale like a multiple of the single batch SGD rate. We give explicit choices for the learning rate and momentum parameter in terms of the Hessian spectra that achieve this performance. Finally we show this model matches performances on real data sets.
Gradient-based methods are the workhorse behind modern machine learning. While they have been extensively studied in the basic framework of supervised learning, they are far less understood in the framework of optimal control, which in its broadest form is equivalent to reinforcement learning. There, algorithms that learn a policy via gradient updates are known as policy gradient methods. In this talk, I will present two recent works analyzing the optimization dynamics and implicit bias of policy gradient, in different contexts. The first work identifies a vanishing gradients problem that occurs when using policy gradient to finetune language models. I will demonstrate the detrimental effects of this phenomenon and present possible solutions. The second work characterizes how the implicit bias of policy gradient affects extrapolation to initial states unseen in training, focusing on the fundamental Linear Quadratic Regulator (LQR) control problem. Overall, our results highlight that the optimization dynamics and implicit bias of policy gradient can substantially differ from those of gradient-based methods in supervised learning, hence require dedicated study.
Works covered in the talk were in collaboration with Nadav Cohen, Yotam Alexander, Edo Cohen-Karlik, Raja Giryes, Amir Globerson, Hattie Zhou, Omid Saremi, Vimal Thilak, Arwen Bradley, Preetum Nakkiran, Joshua Susskind, and Etai Littwin.
Machine learning models, while effective in controlled environments, can fail catastrophically when exposed to unexpected conditions upon deployment. This lack of robustness, well-documented even in state-of-the-art models, can lead to severe harm in high-stakes, safety-critical application domains such as healthcare. This shortcoming raises a central question: How can we develop machine learning models we can trust?
In this talk, I will approach this question from a probabilistic perspective and address deficiencies in the trustworthiness of neural network models using Bayesian principles. Specifically, I will show how to improve the reliability and fairness of neural networks with data-driven domain-informed prior distributions over model parameters. To do so, I will first demonstrate how to perform training in neural networks with such priors using a simple variational optimization objective with a regularizer that reflects the constraints implicitly encoded in the prior. This regularizer is mathematically simple, easy to implement, and can be used as a drop-in replacement for existing regularizers when performing supervised learning in neural networks of any size. I will then show how to construct and use domain-informed data-driven priors to improve uncertainty quantification and group robustness in neural network models for selected application domains.
Sharpness of minima is a promising quantity that can correlate with generalization in deep networks and, when optimized during training, can improve generalization. However, standard sharpness is not invariant under reparametrizations of neural networks, and, to fix this, reparametrization-invariant sharpness definitions have been proposed, most prominently adaptive sharpness (Kwon et al., 2021). But does it really capture generalization in modern practical settings? We comprehensively explore this question in a detailed study of various definitions of adaptive sharpness in settings ranging from training from scratch on ImageNet and CIFAR-10 to fine-tuning CLIP on ImageNet and BERT on MNLI. We focus mostly on transformers for which little is known in terms of sharpness despite their widespread usage. Overall, we observe that sharpness does not correlate well with generalization but rather with some training parameters like the learning rate that can be positively or negatively correlated with generalization depending on the setup. Interestingly, in multiple cases, we observe a consistent negative correlation of sharpness with out-of-distribution error implying that sharper minima can generalize better. Finally, we illustrate on a simple model that the right sharpness measure is highly data-dependent, and that we do not understand well this aspect for realistic data distributions.
Statistical learning theory provides bounds on the necessary number of training samples needed to reach a prescribed accuracy in a learning problem formulated over a given target class. This accuracy is typically measured in terms of a generalization error, that is, an expected value of a given loss function. However, for several applications -- for example in a security-critical context or for problems in the computational sciences -- accuracy in this sense is not sufficient. In such cases, one would like to have guarantees for high accuracy on every input value, that is, with respect to the uniform norm. In this paper we precisely quantify the number of training samples needed for any conceivable training algorithm to guarantee a given uniform accuracy on any learning problem formulated over target classes containing (or consisting of) ReLU neural networks of a prescribed architecture. We prove that, under very general assumptions, the minimal number of training samples for this task scales exponentially both in the depth and the input dimension of the network architecture. As a corollary we conclude that the training of ReLU neural networks to high uniform accuracy is intractable. In a security-critical context this points to the fact that deep learning based systems are prone to being fooled by a possible adversary. We corroborate our theoretical findings by numerical results.
ArXiv / ICLR'23: https://arxiv.org/abs/2205.13531
https://openreview.net/forum?id=nchvKfvNeX0
Github: https://github.com/juliusberner/theory2practice
https://jberner.info/
I will revisit the problem of maximizing information divergence from a new perspective using logarithmic Voronoi polytopes. We will see that for linear models, the maximum is always achieved at the boundary of the probability simplex. For toric models, I will describe an algorithm that combines the combinatorics of the chamber complex with numerical algebraic geometry. I will pay special attention to reducible models and models of maximum likelihood degree one, with many colorful examples. This talk is based on joint work with Serkan Hoşten.
We develop an analytical framework to characterize the set of optimal ReLU neural networks by reformulating the non-convex training problem as a convex program. We show that the global optima of the convex parameterization are given by a polyhedral set and then extend this characterization to the optimal set of the non-convex training objective. Since all stationary points of the ReLU training problem can be represented as optima of sub-sampled convex programs, our work provide a general expression for all critical points of the non-convex objective. We then leverage our results to provide an optimal pruning algorithm for computing minimal networks, establish conditions for the regularization path of ReLU networks to be continuous, and develop sensitivity results for minimal ReLU networks.
Model reparametrization, which follows the change-of-variable rule of calculus (not to be confused with weight-space symmetry), is a popular way to improve the training of neural nets, e.g. in WeightNorm. But it can also be problematic since it can induce inconsistencies in, e.g., Hessian-based flatness measures, optimization trajectories, and modes of probability densities. This complicates downstream analyses: e.g. one cannot definitively relate flatness with generalization since arbitrary reparametrization changes their relationship. In this talk, I will present a study of the invariance of neural nets under reparametrization from the perspective of Riemannian geometry. From this point of view, invariance is an inherent property of any neural net if one explicitly represents the metric and uses the correct associated transformation rules. This is important since although the metric is always present, it is often implicitly assumed as identity, and thus dropped from the notation, then lost under reparametrization. I will discuss implications for measuring the flatness of minima, optimization, and for probability-density maximization. As a bonus, I will also give a teaser of our other recent work in exploiting the geometry of preconditioning matrices to develop an inverse-free, structured KFAC-like second-order optimization method for very large, modern neural nets like transformers. The resulting method is thus numerically stable in low precision and also memory efficient.
This talk will discuss some nontrivial but often pleasant effects of large learning rates, which are commonly used in machine learning practice for improved empirical performances, but defy traditional theoretical analyses. I will first quantify how large learning rates can help gradient descent in multiscale landscape escape local minima, via chaotic dynamics, which provides an alternative to the commonly known escape mechanism due to noises from stochastic gradients. I will then report how large learning rates provably bias toward flatter minimizers. Several related, perplexing phenomena have been empirically observed recently, including Edge of Stability, loss catapulting, and balancing. I will, for the first time, unify them and explain that they are all algorithmic implicit biases of large learning rates. These results are enabled by a first global convergence proof of gradient descent for certain nonconvex functions without Lipschitz gradient. This theory will also provide understanding of when there will be Edge of Stability and other large learning rate implicit biases.
Designing effective positional encodings for graphs is key to building powerful graph transformers and enhancing message-passing graph neural networks’ expressive power. However, since there lacks a canonical order of nodes in the graph-structure data, the choice of positional encodings for graphs is often tricky. For example, Laplacian eigenmap is used as positional encodings in many works. However, it faces two fundamental challenges: (1) Non-uniqueness: there are many different eigendecompositions of the same Laplacian, and (2) Instability: small perturbations to the Laplacian could result in completely different eigenvectors, leading to unpredictable changes in positional encoding. This is governed by the Davis-Kahan theorem, which further negatively impacts the model generalization. In this talk, we are to introduce some ideas on building stable positional encoding and show their benefits in model out-of-distribution generalization. The idea can be extended to some other types of node positional encodings. Finally, we evaluate the effectiveness of our method on molecular property prediction, link prediction, and out-of-distribution generalization tasks, finding improved generalization compared to existing positional encoding methods.
I will mainly talk about these two papers:
1. Equivariant and Stable Positional Encoding for More Powerful Graph Neural Networks Haorui Wang, Haoteng Yin, Muhan Zhang, Pan Li
2. On the Stability of Expressive Positional Encodings for Graphs Yinan Huang, William Lu, Joshua Robinson, Yu Yang, Muhan Zhang, Stefanie Jegelka, Pan Li
Neural networks with random weights appear in a variety of machine learning applications, most prominently as the initialization of many deep learning algorithms and as a computationally cheap alternative to fully learned neural networks. The first goal of this talk is to enhance the theoretical understanding of random neural networks by addressing the following data separation problem: under which conditions can a random neural network make two classes (with positive distance) linearly separable? I will show that a sufficiently large two-layer ReLU-network with Gaussian weights and uniformly distributed biases can solve this problem with high probability. Building on the insights behind this result, I will next present a simple randomized algorithm to produce a small interpolating neural net for a given dataset with two classes. In both results, the size of the network is explicitly linked to geometric properties of the two classes and their mutual arrangement. This instance-specific viewpoint allows to overcome the curse of dimensionality and to prove memorization results beyond worst case lower bounds.
In this talk, I will present a framework, inspired by random matrix theory, for analyzing the dynamics of stochastic optimization algorithms (e.g., stochastic gradient descent (SGD) and momentum (SGD + M)) when both the number of samples and dimensions are large. Using this new framework, we show that the dynamics of optimization algorithms on generalized linear models and multi-index problems with random data become deterministic in the large sample and dimensional limit. In particular, the limiting dynamics for stochastic algorithms are governed by an ODE. In the least square setting, from this model, we identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of O(1/ κ), matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD+M has rates that scale like a multiple of the single batch SGD rate. We give explicit choices for the learning rate and momentum parameter in terms of the Hessian spectra that achieve this performance. Finally we show this model matches performances on real data sets.
In many problems arising in data science and scientific engineering, one must reconstruct a signal from few corrupted measurements, a class of problems known more broadly as inverse problems. Due to the ill-posedness of such problems, one requires a prior: a mathematical model of what makes some signals natural. Recently, there have been great empirical strides in deep generative models, a class of neural network-based models that learn to sample signals from a distribution of interest. Compared to classical priors like sparsity, generative priors lack the same understanding of when and why they work well, especially in relation to notions such as statistical complexity and computational efficiency. Here, I will discuss some theory and applications of using such priors in inverse problems. I will first present a rigorous recovery guarantee for generative priors in the nonlinear inverse problem of phase retrieval. I will show that generative models enable an efficient algorithm from generic measurements with optimal sample complexity. In the second part of the talk, I will address how one can exploit ideas from this approach to tackle problems where ground-truth data may be unavailable in practice, including problems arising from black hole imaging.
I will introduce the idea of phases and phase transitions of the Bayesian posterior in the setting of singular learning theory, and discuss how a simple auto-encoder model introduced by Anthropic in their research on neural network interpretability displays a rich set of phase transitions in both the posterior and over the course of training. I’ll explain a research program we term “developmental” interpretability that is aiming to use phase transitions as the basic primitive for understanding the internal structure of computation in neural networks.
Despite the widespread proliferation of neural networks, the mechanisms through which they operate so successfully are not well understood. In this talk, we will first explore empirical and theoretical investigations into neural network training and generalization and what they can tell us about why deep learning works. Then, we will examine a recent line of work on algorithm learning. While typical neural networks are designed for pattern matching tasks, we consider whether neural networks can learn algorithms that scale to problem instances orders of magnitude larger than those seen during training.
In this talk I’ll discuss recent work studying benign overfitting in two-layer ReLU networks trained using gradient descent and hinge loss on noisy data for binary classification. In particular, we consider linearly separable data for which a relatively small proportion of labels are corrupted and identify conditions on the margin of the clean data which give rise to three distinct training outcomes: benign overfitting, in which zero loss is achieved and with high probability test data is classified correctly; overfitting, in which zero loss is achieved but test data is misclassified with probability lower bounded by a constant; and non-overfitting, in which clean points, but not corrupt points, achieve zero loss and again with high probability test data is classified correctly. Our analysis provides a fine-grained description of the dynamics of neurons throughout training and reveals two distinct phases: in the first phase clean points achieve close to zero loss, in the second phase clean points oscillate on the boundary of zero loss while corrupt points either converge towards zero loss or are eventually zeroed by the network. We prove these results using a combinatorial approach that involves bounding the number of clean versus corrupt updates across these phases of training.
The convergence of scientific computing with statistics and machine learning is an exciting development in scientific computing. In this talk, I will present two lines of work that blur the line between statistical inference and numerical computation. The first part of the talk presents state-of-the-art algorithms for solving elliptic PDEs by interpreting them as Gaussian processes and exploiting their conditional independence properties. This approach allows the efficient learning of elliptic solution operators from solution pairs, with promising empirical results on fractional order PDEs and empirical closure models of turbulent flows.
The second part of the talk discusses how to mitigate the formation of shock singularities in the barotropic Euler equations using an inviscid regularization. This work combines the seminal work of Vladimir Arnold on geometric hydrodynamics with ideas from interior point methods for positive definite programming and the information geometry of Amari and Chentsov.
Recent years have seen substantial interest in a first-principles theoretical understanding of the behavior of overparameterized models that interpolate noisy training data, based on their surprising empirical success. In this talk, I compare classification and regression tasks in the overparameterized linear model. On the one hand, we show that with sufficient overparameterization, solutions obtained by training on the squared loss ( minimum-norm interpolation) typically used for regression, are identical to those produced by training on exponential and polynomially-tailed losses (e.g. the max-margin support-vector-machine), typically used for classification. On the other hand, we show that there exist regimes where these solutions are consistent when evaluated by the 0-1 test loss function, but inconsistent if evaluated by the mean-squared-error test loss function. Our results demonstrate that: a) different loss functions at the training (optimization) phase can yield similar solutions, and b) a significantly higher level of effective overparameterization admits good generalization in classification tasks as compared to regression tasks.
Deep learning has been wildly successful in practice and most state-of-the-art artificial intelligence systems are based on neural networks. Lacking, however, is a rigorous mathematical theory that adequately explains the amazing performance of deep neural networks. In this talk, I present a new mathematical framework that provides the beginning of a deeper understanding of deep learning. This framework precisely characterizes the functional properties of trained neural networks. The key mathematical tools which support this framework include transform-domain sparse regularization, the Radon transform of computed tomography, and approximation theory. This framework explains the effect of weight decay regularization in neural network training, the importance of skip connections and low-rank weight matrices in network architectures, the role of sparsity in neural networks, and explains why neural networks can perform well in high-dimensional problems.
Training deep neural networks for classification often includes minimizing the training loss beyond the zero training error point. In this phase of training, a "neural collapse" (NC) behavior has been empirically observed: the variability of features (outputs of the penultimate layer) of within-class samples decreases and the mean features of different classes approach a certain tight frame structure. Recent papers have shown that minimizers with this structure emerge when optimizing a simplified "unconstrained features model" (UFM) with a regularized cross-entropy loss. In this talk, I will start with a review on the empirical findings on the NC phenomenon. Then, I will present some of our theoretical results (from our publications at ICML22 and ICML23). Namely, we show that the minimizers of a UFM exhibit (a somewhat more delicate) NC structure also with regularized MSE loss, and we analyze the gradient flow in this case, identifying the different dynamics of the features' within- and between-class covariance matrices. Then, we extend the UFM in two different ways to provide mathematical reasoning for the depthwise empirical NC behavior and the effect of regularization hyperparameters on the closeness to collapse. We support our theory with experiments in practical deep learning settings.
In this talk, we will propose a new model and algorithms that go beyond worst-case rates in sequential prediction. In particular, we will focus on sequential prediction over a stochastic sequence with an adversary that is allowed to inject clean-label adversarial (or out-of-distribution) examples as and when they desire. Traditional algorithms designed to handle purely stochastic data tend to fail in the presence of such adversarial examples, often leading to erroneous predictions, whereas, assuming fully adversarial data leads to very pessimistic bounds that are often vacuous in practice. To mitigate this, we will introduce a new model of sequential prediction that sits between the purely stochastic and fully adversarial settings by allowing the learner to abstain from making a prediction at no cost on adversarial examples. Assuming access to the marginal distribution on the non-adversarial examples, we will present a learner whose error scales with the VC dimension (mirroring the stochastic setting) of the hypothesis class, as opposed to the Littlestone dimension which characterizes the fully adversarial setting. Furthermore, we will design learners for certain special hypothesis classes including VC dimension 1 classes, which work even in the absence of access to the marginal distribution. We will conclude with several open questions and plausible connections of our framework with existing models. This is based on joint work with Steve Hanneke, Shay Moran, and Abhishek Shetty.
Inspired by the remarkable success of deep neural networks, there has been significant interest in understanding the generalization performance of overparameterized models. Substantial efforts have been invested in characterizing how optimization algorithms impact generalization through their “preferred” solutions, a phenomenon commonly referred to as implicit regularization. For instance, it has been argued that gradient descent (GD) induces an implicit -norm regularization in regression and classification problems. However, in prior literature, the implicit regularization of different algorithms are confined to either a specific geometry or a particular class of learning problems. To address this gap, we present a unified approach using mirror descent (MD), a notable generalization of GD, to control implicit regularization in both regression and classification settings. More specifically, we show that MD with the general class of homogeneous potential functions converges in direction to a generalized maximum-margin solution for linear classification problems, thereby answering a long-standing question in the classification setting. Furthermore, MD can be implemented efficiently and we demonstrate through experiments that MD is a versatile method to produce learned models with different regularizers and different generalization performances.
How do we ensure that the machine learning algorithms in high-stakes applications, such as hiring, lending, admissions, etc., are fair, explainable, and lawful? Towards addressing this urgent question, this talk will provide some strategies that are deep-rooted in information theory, causality, and statistics. I will discuss a question that bridges the fields of fairness, explainability, and law: how do we check if the disparity in a model is purely due to critical occupational necessities or not? We propose a systematic measure of the legally non-exempt disparity, that brings together a body of work in information theory called Partial Information Decomposition and connects it with causality. I will also briefly talk about some of our other research interests in related topics, such as quantifying accuracy-fairness tradeoffs using Chernoff Information, and also robust counterfactual explanations with theoretical guarantees.
I will describe recent progress on providing rigorous convergence guarantees for score-based generative models (SGMs) such as denoising diffusion probabilistic models (DDPMs), which constitute the backbone of large-scale real-world generative models such as DALL⋅E 2. In the first part of the talk, I will show that such SGMs can efficiently sample from essentially any realistic data distribution, even ones which are highly non-log-concave. In the second part of the talk, I will show how to extend these guarantees to deterministic samplers based on discretizing the so-called probability flow ODE, which ultimately leads to better dependence on the dimension. All of these results assume access to an oracle for score estimation; time permitting, at the end I will briefly touch upon how to provably implement this oracle for interesting classes of distributions like Gaussian mixtures.
Based on the following works: https://arxiv.org/abs/2209.11215, https://arxiv.org/abs/2303.03384, https://arxiv.org/abs/2305.11798, https://arxiv.org/abs/2307.01178
In practice, encoding group invariances into models improves sample complexity. In this talk, we study this phenomenon from a theoretical perspective. In the first part, we provide minimax optimal rates for kernel ridge regression on compact manifolds, with a target function that is invariant to a Lie group action on the manifold. For a finite group, the gain effectively multiplies the number of samples by the group size. For groups of positive dimension, the gain is observed by a reduction in the manifold’s dimension, in addition to a factor proportional to the volume of the quotient space. Our proof takes the viewpoint of differential geometry, in contrast to the more common strategy of using invariant polynomials.
In the second part, we apply our techniques to group-invariant probability distributions that appear in many data-generative models in machine learning, such as graphs, point clouds, and images, and we prove tight sample complexity bounds for the Wasserstein distance estimation under invariances, as well as the density estimation problem.
This talk is based on joint work with Stefanie Jegelka.
Recent theoretical work has shown that under certain conditions, massively overparameterized neural networks are equivalent to kernel regressors with a family of kernels called Neural Tangent Kernels (NTKs). My work in this subject aims to better understand the properties of NTK for various network architectures and relate them to the inductive bias of real neural networks. In particular, I will argue that for input data distributed uniformly on the sphere NTK favors low-frequency predictions over high-frequency ones, potentially explaining why overparameterized networks can generalize even when they perfectly fit their training data. I will further discuss the behavior of NTK when data is distributed nonuniformly and show that NTK (with ReLU activation) is tightly related to the classical Laplace kernel, which has a simple closed-form. Finally, I will discuss our analysis of NTK for convolutional networks, which indicates that these networks are biased toward learning low frequency target functions with any higher frequencies concentrated in local regions. Overall, our results suggest that much insight about neural networks can be obtained from the analysis of NTK.
Algorithmic reasoning requires capabilities which are most naturally understood through recurrent models of computation, like the Turing machine. However, Transformer models, while lacking recurrence, are able to perform such reasoning using far fewer layers than the number of reasoning steps. This raises the question: what solutions are these shallow and non-recurrent models finding? In this talk, we will formalize reasoning in the setting of automata, and show that the computation of an automaton on an input sequence of length T can be replicated exactly by Transformers with o(T) layers, which we call "shortcuts". We provide two constructions, with O(log T) layers for all automata and O(1) layers for solvable automata. Empirically, our results from synthetic experiments show that shallow solutions can also be found in practice.
Understanding the training dynamics of neural networks is quite difficult in general due to the highly nonlinear nature of the parameterization. A breakthrough in the theory of deep learning was the finding that in the infinite-width limit the gradient descent dynamics are characterized by a fixed kernel, coined the “Neural Tangent Kernel” (NTK). In this limiting regime the network is biased to learn the eigenvectors/eigenfunctions of the NTK at rates corresponding to their eigenvalues, a phenomenon known as “spectral bias”. Considerable work has been done comparing the training dynamics of finite-width networks to the idealized infinite-width dynamics. These works typically compare the dynamics of a finite-width network to the dynamics of an infinite-width network where both networks are optimized via the empirical risk. In this work we compare a finite-width network trained on the empirical risk to an infinite-width network trained on the population risk. Consequentially, we are able to demonstrate that the finite-width network is biased towards learning the top eigenfunctions of the NTK over the entire input domain, as opposed to describing the dynamics merely on the training set. Furthermore we can demonstrate that this holds in a regime where the network width is on the same order as the number of training samples, in contrast with prior works that require the unrealistic assumption that the network width is polynomially large in the number of samples. In a separate line of analysis, we characterize the spectrum of the NTK by expressing the NTK as a power series. We demonstrate that the NTK has a small number of large outlier eigenvalues and that the number of such eigenvalues is largely inherited from the structure of the input data. As a result we shed further insight into why the network places a preference on learning a small number of components quicker. In total, our results help classify the properties networks are biased towards in a variety of settings, which we hope will lead to more interpretable artificial intelligence in the long term.
High dimensional probability is a powerful tool to understand phenomena that typically occur in deep learning models. In particular, in this talk, we discuss different concentration bounds that entail consequences about (i) the gradient descent optimization of a deep neural network, (ii) the adversarial robustness of its solution, and (iii) its privacy guarantees. The first main result implies that minimally overparameterized deep neural networks can be successfully (0 training loss) optimized with gradient descent, and leverages tight lower bounds on the smallest eigenvalue of the neural tangent kernel (NTK). The second builds on the recently proposed universal law of robustness, provides sharper bounds on the random features and NTK model, and consequently addresses the conjecture opened by Bubeck, Li and Nagaraj. Finally, we draw a connection between the generalization performance of a model and its privacy guarantees, measured in terms of its safety against a family of information recovery black-box attacks.
Studying the generalization abilities of linear models with real data is a central question in statistical learning. While there exist a limited number of prior works that do validate theoretical work with real data, these works have limitations due to technical assumptions. These assumptions include having a well-conditioned covariance matrix and having independent and identically distributed data. Additionally, prior works that do address distributional shifts usually make technical assumptions on the joint distribution of the train and test data, and do not test on real data. Previous work has also shown that double descent can occur in the over-parameterized regime, and believe that the standard bias-variance trade-off holds in the under-parameterized regime.
In an attempt to address these issues and better model real data, we look at data that is not I.I.D. but has a low-rank structure. Further, we address distributional shift by decoupling assumptions on the training and test distribution. We provide analytical formulas for the generalization error of the denoising problem that are asymptotically exact. These are used to derive theoretical results for linear regression, data augmentation, principal component regression, and transfer learning. We validate all of our theoretical results on real data and have a low relative mean squared error of around 1% between the empirical risk and our estimated risk. Further, we present a simple examplein this paradigm that provably exhibits double descent in the under-parameterized regime.
It was observed that large language models exhibit a power-law decay of cross entropy with respect to the number of parameters and training tokens. When extrapolated literally, this decay implies that the entropy rate of natural language is zero. To understand this phenomenon - or an artifact - better, we construct a simple stationary stochastic process and its memory-based predictor that exhibit a power-law decay of cross entropy with the vanishing entropy rate. Our example is based on previously discussed Santa Fe processes, which decompose a random text into a process of narration and time-independent knowledge. Previous discussions assumed that narration is a memoryless source with Zipf's distribution. In this talk, we propose a model of narration that has the vanishing entropy rate and applies a randomly chosen deterministic sequence called a multiperiodic sequence. Under a suitable parameterization, multiperiodic sequences exhibit asymptotic relative frequencies given by Zipf's law. Remaining agnostic about the value of the entropy rate of natural language, we discuss relevance of similar constructions for language modeling.
Based on: https://arxiv.org/abs/2302.09049
Pruning neural networks at initialization (PaI) has received an upsurge of interest due to its end-to-end saving potential. PaI is able to find sparse subnetworks at initialization that can achieve comparable performance to the full networks. These methods can surpass the trivial baseline of random pruning but suffer from a significant performance gap compared to post-training pruning. Previous approaches firmly rely on weights, gradients, and sanity checks as primary signals when conducting PaI analysis. To better understand the underlying mechanism of PaI, we propose to interpret it through the lens of the Ramanujan Graph - a class of expander graphs that are sparse while being highly connected. It is often believed there should be a strong correlation between the Ramanujan graph and PaI since both are about finding sparse and well-connected neural networks. However, the finer-grained link relating highly sparse and connected networks to their relative performance (i.e., ranking of difference sparse structures at the same specific global sparsity) is still missing. We observe that not only the Ramanujan property for sparse networks shows no significant relationship to PaI’s relative performance, but maximizing it can also lead to the formation of pseudo-random graphs with no structural meanings. We reveal the underlying cause to be Ramanujan Graph’s strong assumption on the upper bound of the largest nontrivial eigenvalue (\mu) of layers belonging to highly sparse networks. We hence propose Iterative Mean Difference of Bound (IMDB) as a mean to relax the \mu upper bound. Likewise, we also show there exists a lower bound for \mu, which we call the Normalized Random Coefficient (NaRC), that gives us an accurate assessment for when sparse but highly connected structure degenerates into naive randomness. Finally, we systematically analyze the behavior of various PaI methods and demonstrate the utility of our proposed metrics in characterizing PaI performance. We show that subnetworks preserving better the IMDB property correlate higher in performance, while NaRC provides us with a possible mean to locate the region where highly connected, highly sparse, and non-trivial Ramanujan expanders exist.
Flatness of the loss curve is conjectured to be connected to the generalization ability of machine learning models, in particular neural networks. While empirical evidence consistently shows a strong correlation between flatness measures and generalization, the theoretical explanation for this connection remains an open problem. Our recent paper, "Relative Flatness and Generalization", proposes an explanation for this relationship by linking it to a model's robustness under feature perturbations and a measure of how representative the training data is of its underlying distribution. We establish a necessary condition for the connection between flatness and generalization to hold.
In this talk, I will provide an overview of our paper's results, focusing on the motivations, insights, and limitations of our approach.
In this talk I will present various insights on the double descent curve obtained by considering a solvable model for deep learning : the random feature model. First, I will present a fine-grained bias-variance decomposition and show how the double descent curve can be reconciled with the traditional bias-variance tradeoff. Then, I will show that two different kinds of overfitting, which are often conflated, can give rise to a “double descent” curve, and can actually occur simultaneously, leading to a triple descent curve. Finally, I will extend some of these findings to classification tasks on structured data, showing the impact of the loss function and the role of low-dimensional structures.
The ability to make sequential decisions under uncertainty is a key component of intelligence. Despite impressive breakthroughs in deep learning in the last decade, we find that scalable and generalizable decision making has so far been elusive for current artificial intelligence (AI) systems. In this talk, I will present a new framework for sequential decision making that is derived from modern generative models for language and perception. We will instantiate our framework in 3 different paradigms for sequential decision making: offline reinforcement learning (RL), online RL, and black-box optimization, and highlight the simplicity and effectiveness of this unifying framework on a range of challenging high-dimensional benchmarks for sequential decision making.
I will present, in this talk, the implicit bias of gradient flow on linear equivariant steerable networks in group-invariant binary classification. Our findings reveal that the parameterized predictor converges in direction to the unique group-invariant classifier with a maximum margin defined by the input group action. Under a unitary assumption on the input representation, we establish the equivalence between steerable networks and data augmentation. Furthermore, we demonstrate the improved margin and generalization bound of steerable networks over their non-invariant counterparts. This talk is based on a joint work with Ziyu Chen at UMass Amherst.
In this talk, I will discuss our research on understanding the infinite-width limit of neural networks. In this limit, neural networks correspond to Neural Network Gaussian Processes (NNGPs) and Neural Tangent Kernels (NTKs). I will first describe our empirical study exploring the relationship between wide neural networks and neural kernel methods. Our study resolves a variety of open questions related to infinitely wide neural networks and opens up new interesting questions. In the second half of the talk, I will discuss our recent work on scaling up infinite-width neural kernel methods to millions of data points. There are unique challenges in scaling up neural kernel methods, and I will talk about our attempts to overcome them. If there is time, I will discuss some applications of the infinite-width limit of neural networks, such as dataset distillation, neural architecture search, uncertainty quantification, and neural scaling laws.
Score-based generative models and diffusion probabilistic models have exhibited exceptional performance in various applications. A natural question that arises is whether the distribution generated by the model is closely aligned with the given data distribution. In this talk, we will explore an upper bound of the Wasserstein distance between these two distributions. Based on the theory of optimal transport, we guarantee that the framework can approximate data distributions in the space of probability measures equipped with the Wasserstein distance.
This talk is based on joint work with Ying Fan and Kangwook Lee (UW-Madison).
Stochastic finite-sum optimization problems are ubiquitous in many areas such as machine learning, and stochastic optimization algorithms to solve these problems are actively studied in the literature. However, there is a major gap between practice and theory: practical algorithms shuffle and iterate through component indices, while most theoretical analyses of these algorithms assume uniformly sampling the indices in an i.i.d. manner. In this talk, we talk about recent research efforts to close this theory-practice gap. We will discuss recent developments in the convergence analysis of shuffling-based optimization methods. We will first consider minimization algorithms, mainly focusing on stochastic gradient descent (SGD) with shuffling; we will then briefly talk about some new progress on minimax optimization methods.
Transport maps characterize probability distributions by coupling random variables using a deterministic transformation. While bijective transport maps enable sampling and density evaluations of the transformed random variable, learning the parameters of such transformations in high dimensions is challenging given few samples from an unknown target distribution. Moreover, structural choices for these transformations can have a significant impact on their performance and the optimization procedure for finding these maps. In this talk, we will present a framework for representing and learning monotone triangular maps via invertible transformations of smooth functions, and will first demonstrate that the associated optimization problem has no spurious local minima, i.e., all local minima are global minima. Second, we propose a sample-efficient adaptive algorithm that construct a sparse approximation for the map. We demonstrate how this framework can be applied for joint and conditional density estimation, likelihood-free inference, and structure learning of directed graphical models with stable generalization performance across a range of sample sizes.
First-order optimization algorithms are widely used today. Two standard building blocks in these algorithms are proximal operators (proximals) and gradients. Although gradients can be computed for a wide array of functions, explicit proximal formulas are only known for limited classes of functions. We provide an algorithm, HJ-Prox, for accurately approximating such proximals. This is derived from a collection of relations between proximals, Moreau envelopes, Hamilton-Jacobi (HJ) equations, heat equations, and importance sampling. In particular, HJ-Prox smoothly approximates the Moreau envelope and its gradient. The smoothness can be adjusted to act as a denoiser. Our approach applies even when functions are only accessible by (possibly noisy) blackbox samples. Our approach can also be embedded into a zero-order algorithm with guaranteed convergence to global minima, assuming continuity of the objective function; this is done by leveraging connections between the gradient of the Moreau envelope and the proximal operator. We show HJ-Prox is effective numerically via several examples.
Policy gradient algorithms are amongst the most widely used reinforcement learning methods and have driven a lot of recent empirical success in applications of reinforcement learning, especially with deep neural network based policies. These methods apply to complex, poorly understood control problems by performing stochastic gradient descent over a parameterized class of policies. However, even for simple control problems solvable by classical dynamic programming techniques, policy gradient algorithms face non-convex optimization problems and are widely understood to converge only to a stationary point. We make fundamental progress in developing a theory for policy gradient methods - providing insights into when and why policy gradients are guaranteed to converge to globally optimal policies. In particular, we identify structural properties of the underlying markov decision process (MDP) which ensure that despite non-convexity, the policy gradient objective function has no suboptimal stationary points. We also show that the policy gradient objective satisfies a Polyak-lojasiewicz (gradient dominance) condition when some of these conditions are strengthened, yielding convergence rates. When some of these conditions are relaxed, we provide bounds on the optimality gap of any stationary point. Our results leverage on a key connection to policy iteration, a classic and well-studied dynamic programming algorithm, via a policy gradient theorem.
The parameter space for any fixed architecture of neural networks serves as a proxy during training for the associated class of functions - but how faithful is this representation? For any fixed feedforward ReLU network architecture with at least one hidden layer, it is well-known that many different parameter settings can determine the same function. It is less well-known that the degree of this redundancy is inhomogeneous across parameter space. This inhomogeneity should impact the dynamics of training via gradient descent, especially when compared with recent work suggesting that SGD favors flat minima of the loss landscape.
In this talk, I will carefully define the notion of the local functional dimension of a feedforward ReLU network function, discuss the relationship between local functional dimension of a parameter and the geometry of the underlying decomposition of the domain into linear regions, and present some preliminary experimental results suggesting that the probability distribution of the functional dimension at initialization is both interesting and architecture-dependent. Some of this work is joint with Kathryn Lindsey, Rob Meyerhoff, and Chenxi Wu, and some is joint with Kathryn Lindsey and David Rolnick.
We revisit the challenging problem of training Gaussian-Bernoulli restricted Boltzmann machines (GRBMs), introducing two innovations. We propose a novel Gibbs-Langevin sampling algorithm that outperforms existing methods like Gibbs sampling. We modified the contrastive divergence (CD) algorithm in two ways: 1) adding variance-dependent initial step sizes for negative sampling; 2) drawing initial negative samples from Gaussian noise. We show this modified CD along with gradient clipping is enough to robustly train GRBMs with large learning rates, thus removing the need for various tricks in the literature. Moreover, it enables GRBMs to generate samples starting from noise, thus allowing direct comparisons with deep generative models and improving evaluation protocols in the RBM literature. Experiments on Gaussian Mixtures, MNIST, FashionMNIST, and CelebA show GRBMs can generate good samples, despite their single-hidden-layer architecture. Our code is released: https://github.com/lrjconan/GRBM.
The recipe behind the success of deep learning has been the combination of neural networks and gradient-based optimization. Understanding the behavior of gradient descent however, and particularly its instability, has lagged behind its empirical success. To add to the theoretical tools available to study gradient descent we propose the principal flow (PF), a continuous time flow that approximates gradient descent dynamics. To our knowledge, the PF is the only continuous flow that captures the divergent and oscillatory behaviors of gradient descent, including escaping local minima and saddle points. Through its dependence on the eigendecomposition of the Hessian the PF sheds light on the recently observed edge of stability phenomena in deep learning. Using our new understanding of instability we propose a learning rate adaptation method which enables us to control the trade-off between training stability and test set evaluation performance.
Conventionally, the stability of stochastic gradient descent (SGD) is understood through a linear stability analysis, where the mean and variance of the parameter or the gradients are examined to determine the stability of SGD close to a stationary point. In this seminar, we discuss the limitations of linear stability theories and motivate a new notion of stability, which we call the probabilistic stability. We first explain why this notion of stability is especially suitable for understanding SGD at a large learning rate and a small batch size in toy problems. Then, with this new notion of stability, we study the implicit bias of SGD and show that SGD at a large learning rate converges to low-rank saddles in matrix factorization problems.
The talk is mainly based on the following two works:
[1] Liu Ziyin, Botao Li, James B. Simon, Masahito Ueda. SGD with a Constant Large Learning Rate Can Converge to Local Maxima. ICLR 2022.
[2] The Probabilistic Stability of SGD. (tentative title, in preparation)
Ordinary differential equation (ODE) is widely used in modeling biological and physical processes in science. In this talk, I will discuss a new reproducing kernel-based approach for estimation and inference of ODE given noisy observations. We do not assume the functional forms in ODE to be known, or restrict them to be linear or additive, and we allow pairwise interactions. We construct confidence intervals for the estimated signal trajectories, and establish the estimation optimality and selection consistency of kernel ODE under both the low-dimensional and high-dimensional settings, where the number of unknown functionals can be smaller or larger than the sample size. Our proposal builds upon the smoothing spline analysis of variance (SS-ANOVA) framework, but tackles several important problems that are not yet fully addressed, and thus extends the scope of existing SS-ANOVA too. Our proposal is also advantageous, in terms of statistical inference with noisy observations, compared to the existing physics-informed neural networks and sparsity-based methods.
Existing analyses of neural network training often operate under the unrealistic assumption of an extremely small learning rate. This lies in stark contrast to practical wisdom and empirical studies, such as the work of J. Cohen et al. (ICLR 2021), which exhibit startling new phenomena (the "edge of stability" or "unstable convergence") and potential benefits for generalization in the large learning rate regime. Despite a flurry of recent works on this topic, however, the latter effect is still poorly understood. In this paper, we take a step towards understanding genuinely non-convex training dynamics with large learning rates by performing a detailed analysis of gradient descent for simplified models of two-layer neural networks. For these models, we provably establish the edge of stability phenomenon and discover a sharp phase transition for the step size below which the neural network fails to learn "threshold-like" neurons (i.e., neurons with a non-zero first-layer bias). This elucidates one possible mechanism by which the edge of stability can in fact lead to better generalization, as threshold neurons are basic building blocks with useful inductive bias for many tasks.
The MBO scheme is a highly performant scheme used for data clustering. Given some data, one constructs the similarity graph associated to the data points. The goal is to split the data into meaningful clusters. The algorithm produces the clusters by alternating between diffusion on the graph and pointwise thresholding. In this talk I will present the first theoretical studies of the scheme in the large data limit. We will see how the final state of the algorithm is asymptotically related to minimal surfaces in the data manifold and how the dynamics of the scheme is asymptotically related to the trajectory of steepest descent for surfaces, which is mean curvature flow. The tools employed are variational methods and viscosity solutions techniques. Based on joint work with Tim Laux (U Bonn).
Modern machine learning methods, in particular deep learning approaches, have enjoyed unparalleled success in a variety of challenging application fields like image recognition, medical image reconstruction, and natural language processing. While a vast majority of previous research in machine learning mainly focused on constructing and understanding models with high predictive power, consensus has emerged that other properties like stability and robustness of models are of equal importance and in many applications essential. This has motivated researchers to investigate the problem of adversarial training (or how to make models robust to adversarial attacks), but despite the development of several computational strategies for adversarial training and some theoretical development in the broader distributionally robust optimization literature, there are still several theoretical questions about it that remain relatively unexplored. In this talk, I will take an analytical perspective on the adversarial robustness problem and explore three questions: 1)What is the connection between adversarial robustness and inverse problems?, 2) Can we use analytical tools to find lower bounds for adversarial robustness problems?, 3) How do we use modern tools from analysis and geometry to solve adversarial robustness problems? At its heart, this talk is an invitation to view adversarial machine learning through the lens of mathematical analysis, showcasing a variety of rich connections with perimeter minimization problems, optimal transport, mean field PDEs of interacting particle systems, and min-max games in spaces of measures.
Empirical observation of high dimensional phenomena, such as the double descent behavior, has attracted a lot of interest in understanding classical techniques such as kernel methods, and their implications to explain generalization properties of neural networks that operate close to kernel regime. Many recent works analyze such models in a certain high-dimensional regime where the covariates are generated by independent sub-Gaussian random variables transformed through a covariance matrix and the number of samples and the number of covariates grow at a fixed ratio (i.e. proportional asymptotics). In this work we show that for a large class of kernels, including the neural tangent kernel of fully connected networks, kernel methods can only perform as well as linear models in this regime.
More surprisingly, when the data is generated by a Gaussian process model where the relationship between input and the response could be very nonlinear, we show that linear models are in fact optimal, i.e. linear models achieve the minimum risk among all models, linear or nonlinear.
These results suggest that more complex models for the data other than independent features are needed for high-dimensional analysis.
Large-scale optimization is at the forefront of modern data science, scientific computing, and applied mathematics with areas of interest, including high-dimensional PDE, inverse problems, machine learning, etc. First-order methods are workhorses for large-scale optimization due to modest computational cost and simplicity of implementation. Nevertheless, these methods are often agnostic to the structural properties of the problem under consideration and suffer from slow convergence, being trapped in bad local minima, etc. Natural gradient descent is an acceleration technique in optimization that takes advantage of the problem's geometric structure and preconditions the objective function's gradient by a suitable "natural" metric. Hence parameter update directions correspond to the steepest descent on a corresponding "natural" manifold instead of the Euclidean parameter space rendering a parametrization invariant descent direction on that manifold. Despite its success in statistical inference and machine learning, the natural gradient descent method is far from a mainstream computational technique due to the computational complexity of calculating and inverting the preconditioning matrix. This work aims at a unified computational framework and streamlining the computation of a general natural gradient flow via the systematic application of efficient tools from numerical linear algebra. We obtain efficient and robust numerical methods for natural gradient flows without directly calculating, storing, or inverting the dense preconditioning matrix. We treat Euclidean, Wasserstein, Sobolev, and Fisher–Rao natural gradients in a single framework for a general loss function.
Neural networks are powerful feature extractors - but which features do they extract from their data? And how does the structure of the training data shape the representations they learn? We discuss two recent works in this direction. We first develop a simple model for images and show that a neural network trained on these images can learn a convolution from scratch [1]. This pattern-formation process is driven by a combination of translation-invariance of the "images" and the non-Gaussian, higher-order statistics of the inputs. Second, we conjecture a "distributional simplicity bias" whereby neural networks learn increasingly complex distributions of their inputs during training. We present analytical and experimental evidence for this conjecture, going from a simple perceptron up to deep ResNets [2].References:[1] Ingrosso & Goldt, PNAS 119 (40), e2201854119, arXiv:2202.00565[2] Refinetti, Ingrosso & Goldt, in preparation.
We study the optimization landscape of deep linear neural networks with the square loss. It is known that, under weak assumptions, there are no spurious local minima and no local maxima. However, the existence and diversity of non-strict saddle points, which can play a role in first-order algorithms' dynamics, have only been lightly studied. We go a step further with a full analysis of the optimization landscape at order 2. We characterize, among all critical points, which are global minimizers, strict saddle points, and non-strict saddle points. We enumerate all the associated critical values. The characterization involves conditions on the ranks of partial matrix products, and sheds some light on global convergence or implicit regularization that have been proved or observed when optimizing linear neural networks. In passing, we provide an explicit parameterization of the set of all global minimizers and exhibit large sets of strict and non-strict saddle points.
We will study the dynamics of Gradient Flow when training a depth 2 ReLU network on univariate data in a binary classification setting. Assuming our data is labelled by a teacher network with width , we will show a convergence guarantee that holds if the learned student network is sufficiently over-parameterized. Moreover, we will show that the implicit bias in our setting is such that the student network converges to a network with at most linear regions, which results in an intuitive generalization bound. Overall, this provides an end-to-end learnability result in this setting, demonstrating the interplay between optimization and generalization: Over-parameterization facilitates optimization, while the implicit bias prevents overfitting.
Understanding the fundamental principles behind the massive success of neural networks is one of the most important open questions in deep learning. However, due to the highly complex nature of the problem, progress has been relatively slow. In this note, through the lens of infinite-width networks, a.k.a. neural kernels, we present one such principle resulting from hierarchical localities. It is well-known that the eigenstructure of infinite-width multilayer perceptrons (MLPs) depends solely on the concept frequency, which measures the order of interactions. We show that the topologies from deep convolutional networks (CNNs) restructure the associated eigenspaces into finer subspaces. In addition to frequency, the new structure also depends on the concept space, which measures the spatial distance among nonlinear interaction terms. The resulting fine-grained eigenstructure dramatically improves the network's learnability, empowering them to simultaneously model a much richer class of interactions, including Long-Range-Low-Frequency interactions, Short-Range-High-Frequency interactions, and various interpolations and extrapolations in-between. Additionally, model scaling can improve the resolutions of interpolations and extrapolations and, therefore, the network's learnability. Finally, we prove a sharp characterization of the generalization error for infinite-width CNNs of any depth in the high-dimensional setting. Two corollaries follow: (1) infinite-width deep CNNs can break the curse of dimensionality without losing their expressivity, and (2) scaling improves performance in both the finite and infinite data regimes.
Given one optimization problem, consider pairing it with another by smoothly parametrizing the domain. This is done either for practical purposes (e.g., to use smooth optimization algorithms with good guarantees) or for theoretical purposes (e.g., to reveal that the landscape satisfies a strict saddle property). In both cases, the central question is: how do the landscapes of the two problems relate? Surprisingly, the relation is often determined by the parametrization itself; it is almost entirely independent of the cost function. In this talk, I will present a new geometric framework for studying parametrizations according to their effect on landscapes. Applications include: optimization over low-rank matrices and tensors by optimizing over a factorization; the Burer-Monteiro approach to semidefinite programs; training neural networks by optimizing over their weights and biases; and quotienting out symmetries. Joint with Eitan Levin (Caltech) and Nicolas Boumal (EPFL).
Modern neural networks are usually over-parameterized—the number of parameters exceeds the number of training data. In this case the loss functions tend to have many (or even infinite) global minima, which imposes an additional challenge of minima selection on optimization algorithms besides the convergence. Specifically, when training a neural network, the algorithm not only has to find a global minimum, but also needs to select minima with good generalization among many other bad ones. In this talk, I will share a series of works studying the mechanisms that facilitate global minima selection of optimization algorithms. First, with a linear stability theory, we show that stochastic gradient descent (SGD) favors flat and uniform global minima. Then, we build a theoretical connection of flatness and generalization performance based on a special structure of neural networks. Next, we study the global minima selection dynamics—the process that an optimizer leaves bad minima for good ones—in two settings. For a manifold of minima around which the loss function grows quadratically, we derive effective exploration dynamics on the manifold for SGD and Adam, using a quasistatic approach. For a manifold of minima around which the loss function grows subquadratically, we study the behavior and effective dynamics for GD, which also explains the edge of stability phenomenon.
Dynamic processes have been modeled successfully for hundreds of years, often using ordinary or partial differential equations. Using data-driven methods, these processes can now also be inferred directly from measurements. In this talk, I will provide an overview of our work in this direction. I will discuss learning differential equations on reduced spaces, how to utilize numerical integration schemes to train neural networks for stochastic dynamics, and close with an alternative view on system identification with the Koopman operator framework.
Deep learning comes with excessive demands for data. In this talk, I will present my recent work on showing that not all data is necessary for training an accurate predictor. In particular, one can drop “easy-to-learn” examples, and do just as well as learning on all of the data. Given this disparate “importance” of training data on generalization, I will present empirical analysis of the loss landscape derived from different subsets of the training examples. I will then look into how the training dynamics are influenced by easy versus hard data.
Wasserstein GANs (WGANs) are based on the idea of minimising the Wasserstein distance between a real and a generated distribution. In this talk, we provide an in-depth mathematical analysis of differences between the theoretical setup and the reality of training WGANs. We gather both theoretical and empirical evidence that the WGAN loss is not a meaningful approximation of the Wasserstein distance. Moreover, we argue that the Wasserstein distance is not even a desirable loss function for deep generative models, and conclude that the success of WGANs can be attributed to a failure to approximate the Wasserstein distance.
This talk will discuss two aspects of optimization geared towards improving generalization in Deep Learning, involving: i) regularization effect of SGD, and ii) model averaging. The first part of the talk will describe a mechanism that captures the implicit *Regularization Effect of SGD*. We show that it depends on the Fisher Information Matrix and replicate SGD’s implicit regularization effect with an explicit form. We will then delve into OOD generalization where we demonstrate that the difference in performance of deep models on in-domain validation sets and distribution-shifted test sets varies in a chaotic manner during training using ERM. This leads to poor model selection when using in-domain validation data for early stopping, and makes the trained model unreliable. We discuss this problem and propose *Model Averaging* as an approach of mitigation, leading to state-of-the-art performance on the DomainBed benchmark. I will conclude with a preview of my future research directions in the broad area of OOD generalization. Bio:I am currently a senior research scientist at Salesforce AI Research. Prior to this, I was a postdoc at Mila with Prof. Yoshua Bengio. I received my PhD and M.S. degrees in Computer Science and Engineering from University at Buffalo, and B.S. degree in Electrical Engineering from IIT-BHU, India.I am interested in representation learning analysis and algorithms for supervised/self-supervised learning, generative modeling, and more recently out of distribution generalization. My interest also lies at the intersection of optimization and generalization in deep learning, specifically, aspects of deep learning optimization that improve generalization. I have recently also worked on time series analytics in which I designed deep learning algorithms for forecasting and anomaly detection. I am currently developing a toolbox for causal discovery for tabular and time series data, as well as conducting research in this area.
Many applications involve non-Euclidean data, such as graphs, strings, or matrices. Exploiting such geometric structure in optimization can deliver algorithms that are computationally superior to standard nonlinear programming approaches. This observation has resulted in an increasing interest in Riemannian methods in the optimization and machine learning community. In this talk, we consider the problem of optimizing a function on a (Riemannian) manifold subject to convex constraints. Several classical problems that arise in Machine Learning applications can be phrased as constrained optimization on manifolds. This includes matrix-valued tasks, such as barycenter problems, as well as the computation of Brascamp-Lieb constants. The latter is of central importance in many areas of mathematics and computer science through connections to maximum likelihood estimators in Gaussian models, Tyler’s M-estimator of scatter matrices and Operator Scaling. We introduce Riemannian Frank-Wolfe methods, a class of first-order methods for solving constrained optimization problems on manifolds and present a global, non-asymptotic convergence analysis. We further discuss a class of CCCP-style algorithms for Riemannian “difference of convex” functions and explore connections to constrained optimization. For both methods we discuss applications to the two problems described above. Based on joint work with Suvrit Sra.
We introduce and explore the problem of Selectively Forgetting a particular subset of the data used for training a deep neural network. We propose an initial method for "scrubbing'" the weights clean by minimizing the Forgetting Lagrangian, such that any probing function of the weights is indistinguishable from the same function applied to the weights of a network trained without the data to be forgotten. This method is improved upon by using a deterministic update of the linearized version of the model inspired by NTKs, which enables us to bound the extracted information in a black-box setting. This inspired us to propose Linear Quadratic Fine-tuning (LQF), which is the first method for linearizing a pre-trained model which achieves comparable performance to non-linear fine-tuning. LQF allows us to exploit the strength of deep neural networks while enjoying the theoretical properties of convex optimization. Using LQF, we introduce a novel notion of forgetting in the Mixed-Privacy setting which enables us to perform forgetting for large scale vision tasks, while providing theoretical guarantees. We further extend this mixed privacy setting to Differential Privacy (DP) and introduce AdaMix, an adaptive DP algorithm which exploits few-shot public data to improve the privacy trade-off for practical vision tasks.
In this talk I am going to talk about two problems that Message Passing Neural Networks (MPNNs) have been shown to be struggling from. The first one -- known as over-squashing -- is unavoidable in the MPNN class and concerns the input graph topology. This relates to how information propagates in a graph. We show that discrete curvature quantities (old and new) could help us understanding where messages are being lost and we can provably characterize the over-squashing phenomenon in terms of curvature. The second problem consists in analysing GNNs as multi-particle dynamics using the lens of gradient flows of an energy. We investigate what happens when instead of learning the MPNN equations we learn an energy and then let the equations follow the gradient flow of such energy. This allows us to understand further the role of the channel-mixing matrix that is ubiquitous in standard graph convolutional models as a bilinear potential inducing both attraction and repulsion along edges via its positive and negative eigenvalues respectively.
Understanding how deep neural networks generalize remains notoriously challenging in theory. This talk will motivate and examine a simpler question, that is, the generalization of high-dimensional linear regression models, using several empirical high-performing real-world neural-network-induced settings as a testbed (e.g. the empirical NTK of a pretrained ResNet applied to CIFAR-100). We find that, perhaps surprisingly, even in these linear settings, most existing theoretical analyses for linear/kernel regression fail to qualitatively capture the empirical generalization phenomena. On the other hand, a random matrix theory hypothesis gives rise to an estimator that accurately predicts generalization. Based on https://arxiv.org/abs/2203.06176 (ICML 2022). Bio: Wei Hu is a postdoc at UC Berkeley and an incoming assistant professor at the University of Michigan. He obtained his Ph.D. from Princeton University advised by Sanjeev Arora, and his bachelor's degree from Tsinghua University. He is interested in the theoretical and scientific foundations of modern machine learning, with a particular focus on deep learning.
Kernel ridge regression (KRR) has recently attracted renewed interest due to its potential for explaining the transient effects, such as double descent, that emerge during neural network training. In this work, we study how the alignment between the target function and the kernel affects the performance of the KRR. We focus on the truncated KRR (TKRR) which utilizes an additional parameter that controls the spectral truncation of the kernel matrix. We show that for polynomial alignment, there is an over-aligned regime, in which TKRR can achieve a faster rate than what is achievable by full KRR. The rate of TKRR can improve all the way to the parametric rate, while that of full KRR is capped at a sub-optimal value. This shows that target alignemnt can be better leveraged by utilizing spectral truncation in kernel methods. We also consider the bandlimited alignment setting and show that the regularization surface of TKRR can exhibit transient effects including multiple descent and non-monotonic behavior. Our results show that there is a strong and quantifable relation between the shape of the alignment spectrum and the generalization performance of kernel methods, both in terms of rates and in finite samples.
Over-parameterized machine learning models lead to excellent performance in a variety of applications. In this talk we consider the effect of over-parameterization on stochastic optimization algorithms. We discuss how over-parameterization allows us to use a constant step size within stochastic gradient methods, and that this leads to a faster convergence rate. We also present algorithms with provably-faster convergence rates in the over-parameterized setting. Finally, we discuss how over-parameterization allows us to update the learning rate during the training procedure which leads to improved performance over a variety of previous approaches.
While over-parameterization is widely believed to be crucial for the success of optimization for the neural networks, most existing theories on over-parameterization do not fully explain the reason -- they either work in the Neural Tangent Kernel regime where neurons don't move much, or require an enormous number of neurons. In this talk I will describe our recent works towards understanding training dynamics that go beyond kernel regimes with only polynomially many neurons (mildly overparametrized). In particular, we first give a local convergence result for mildly overparametrized two-layer networks. We then analyze the global training dynamics for a related overparametrized tensor model. For both works, we rely on a key intuition that neurons in overparametrized models work in groups and it's important to understand the behavior of an average neuron in the group. Based on two works: https://arxiv.org/abs/2102.02410 and https://arxiv.org/abs/2106.06573
In this talk we discuss almost sure convergence of Stochastic Gradient Descent in discrete and continuous time for a given twice continuously-differentiable target function F. In a first step we give assumptions on the step-sizes and perturbation size to ensure convergence of the target value F and gradient of F assuming that grad F is locally Hölder-continuous. This result entails convergence of the iterates itself in the case where F does not possess a continuum of critical points. In a general non-convex setting with F possibly containing a rich set of critical points, convergence of the process itself is sometimes taken for granted, but actually is a non-trivial issue as there are solutions to the gradient flow ODE for smooth target functions that stay in a compact set but do not converge. Using the Lojasiewicz-inequality we give sharp bounds on the step-sizes and the size of the perturbation in order to guarantee convergence of the SGD scheme for analytic target functions. Also, we derive the convergence rate of the function value under the assumptions that F satisfies a particular Lojasiewicz-inequality with exponent in [1/2,1). Finally, we compare the discrete and continuous time results and discuss optimality of the assumptions. This is joint work with Steffen Dereich (WWU Münster).
Wasserstein GANs (WGANs) are based on the idea of minimising the Wasserstein distance between a real and a generated distribution. In this talk, we provide an in-depth mathematical analysis of differences between the theoretical setup and the reality of training WGANs. We gather both theoretical and empirical evidence that the WGAN loss is not a meaningful approximation of the Wasserstein distance. Moreover, we argue that the Wasserstein distance is not even a desirable loss function for deep generative models, and conclude that the success of WGANs can be attributed to a failure to approximate the Wasserstein distance.
There is an increasing realization that algorithmic inductive biases are central in preventing overfitting; empirically, we often see that machine learning models trained by stochastic gradient descent (SGD) generalize well, even in the overparameterized settings and under little to no explicit regularization. This work considers this issue in arguably the most basic setting: averaged or last-iterate SGD for linear regression in the overparameterized regime. Our main result provides a sharp excess risk bound, stated in terms of the full eigenspectrum of the data covariance matrix, that reveals a bias-variance decomposition characterizing when generalization is possible: (i) the variance bound is characterized in terms of an effective dimension (specific for SGD) and (ii) the bias bound is characterized in terms of the optimal model parameter and how it aligns with the data covariance matrix. Moreover, for SGD with iterate averaging, we prove a matching lower bound (up to constant factors); for last iterate SGD (with decaying stepsizes), we provide a matching lower bound (up to constant factors) when the signal-to-noise ratio is bounded. The above results have the following additional implications:(1) We compare the implicit regularization afforded by SGD with the explicit regularization of ridge regression. We find that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances. (2) We study the effects of transfer learning with pretraining and finetuning in the presence of covariate shift. We find that for a large class of linear regression instances, transfer learning with O(N^2) source data (and scarce or no target data) is as effective as supervised learning with N target data. References:https://arxiv.org/abs/2103.12692https://arxiv.org/abs/2108.04552https://arxiv.org/abs/2110.06198
Models built with deep neural network (DNN) can handle complicated real-world data extremely well, without suffering from the curse of dimensionality or the non-convex optimization. To contribute to the theoretical understanding of deep learning, we will investigate the nonparametric aspects of DNNs by addressing the following questions: (i) what kind of data can be best learned by deep neural networks? (ii) can deep neural networks achieve the statistical optimality? (iii) is there any algorithmic guarantee for obtaining such optimal neural networks? Our theoretical analysis applies to two most fundamental setup in practice: regression and classification.
Designing deep learning models for practical applications, including those in machine learning, is an art that often involves an expensive search over candidate architectures. We develop novel frameworks to facilitate the process of designing deep learning models via three principled approaches: optimization, numerical analysis, and statistical modeling. First, using optimization as a tool, we develop new families of momentum-based deep learning models that take advantage of momentum methods to improve the convergence speed and ability to capture long-range dependencies in models, including deep neural networks and neural ordinary differential equations. Second, we employ numerical methods such as diffusion with a source term to develop new graph neural networks with better accuracy and efficiency. Finally, we develop new generative models that help explain the attention mechanism in transformers and suggest new directions to improve the performance of these models.
Modern machine learning models, particularly those used in deep networks, are characterized by massive numbers of parameters trained on large data sets. While these large-scale models have had tremendous practical successes, developing theoretical methods that can rigorously explain when and why these models work, has been a major issue in the field. This task is even made harder by the non-convexity of the underlying learning problems. In this talk, we shed light on the theoretical understanding of the asymptotics of learning for two popular neural network models, namely, Generalized Linear Models (GLMs) and Recurrent Neural Networks (RNNs). First, we investigate the generalizability of single-layer neural networks (i.e., GLMs) over previously unseen data. We provide a general framework to characterize the asymptotic generalization error for GLMs with arbitrary non-linearities, making it applicable to regression as well as classification problems. This framework enables analyzing the effect of (i) over-parameterization and non-linearity during modeling; (ii) choices of the loss function, initialization, and regularizer during learning; and (iii) mismatch between training and test distributions. We also rigorously and analytically explain the double descent phenomenon in generalized linear models. Secondly, we explore the learning dynamics of recurrent neural networks under gradient descent. We focus on a subclass of RNNs with linear activations and provide precise reasoning for the common knowledge suggesting that RNNs do not perform well on tasks requiring long-term memory. Using recently-developed kernel regime analysis, our main result shows that linear RNNs learned from random initializations are functionally equivalent to a certain weighted 1D-convolutional network. Importantly, the weightings in the equivalent model cause an implicit bias to elements with smaller time lags in the convolution and hence shorter memory. We show that the degree of this bias depends on the variance of the transition matrix at initialization. Our theories are validated with both synthetic and real data experiments.
We examine two data science questions motivated by two biological problems: (1) modeling time varying microbial communities and (2) modeling shapes and surfaces for evolutionary applications. Question 1: Under what conditions are Bayesian methods for inference in deterministic dynamical systems with observation noise consistent. We will highlight the relation between Bayesian inference and a classical idea in dynamical systems called the thermodynamic formalism. We will provide a partial answer to Question 1 based on a variational characterization of a partition function. Question 2: Given a set of shapes the synchronization problem is to consistently register or aligning the shapes. We develop a geometric framework that characterizes the synchronization problem. We use classic tools from the theory of fibre bundles in the geometric characterization. We then generalize the graph theoretic notion of a Laplacian to quantify obstructions to synchronization, specifically we develop a twisted Hodge theory. We then state an algorithm that learns group actions and demonstrate it's efficacy by clustering the molars of primates based on their shape. We will see the clusters correspond to eating habits.
Modern neural networks often have great expressive power and can be trained to overfit the training data, while still achieving a good test performance. This phenomenon is referred to as “benign overfitting”. Recently, there emerged a line of works studying “benign overfitting” from the theoretical perspective. However, they are limited to linear models or kernel/random feature models, and there is still a lack of theoretical understanding about when and how benign overfitting occurs in neural networks. In this talk, I will give an answer to the above question. We precisely characterize the conditions under which benign overfitting can occur in training two-layer convolutional neural networks. We show that when the signal-to-noise ratio satisfies a certain condition, a two-layer CNN trained by gradient descent can achieve arbitrarily small training and test loss. On the other hand, when this condition does not hold, overfitting becomes harmful and the obtained CNN can only achieve a constant level of test loss. These together demonstrate a sharp phase transition between benign overfitting and harmful overfitting, driven by the signal-to-noise ratio. This talk is based on joint work with Yuan Cao, Mikhail Belkin, and Quanquan Gu.
The mysterious ability of neural networks to generalize is believed to stem from an implicit regularization — a tendency of gradient-based optimization to fit training data with predictors of low “complexity.” Despite vast efforts, a satisfying formalization of this intuition is lacking. In this talk I will present a series of works theoretically analyzing the implicit regularization in matrix and tensor factorizations, known to be equivalent to certain linear and non-linear neural networks, respectively. Through dynamical characterizations I will establish an implicit regularization towards low rank (for corresponding notions of rank), different from any type of norm minimization, in contrast to prior beliefs. I will then discuss implications of this finding to both theory (possible explanation for generalization over natural data) and practice (compression of neural network layers, novel regularization schemes). Overall, our results highlight the potential of ranks to explain and improve generalization in deep learning. Works covered in this talk were done in collaboration with Asaf Maman and Nadav Cohen.
In this talk, I will present some implications of the permutation-symmetry in neural networks on the shape of the loss landscape. I will first introduce a type of saddle point, so-called symmetry-induced saddles, emerging from a particular arrangement of neurons in deep neural networks. Then we will describe the precise geometry of the global minima manifold of overparameterized networks in a teacher-student setting. Counting the possible arrangements of neuron groups inside a neural network, we will give the numbers of symmetry-induced saddle manifolds and the components of the global minima manifold in terms of the student and the teacher widths. Our analysis shows that overparameterization gradually smoothens the landscape due to a faster scaling of the global minima manifold components than the symmetry-induced saddle manifolds. Yet the landscape exhibits roughness in mildly overparameterized networks; we empirically show that gradient-based training finds a zero-loss solution only for a fraction of initializations in this regime.
In the era of big data, the training data for machine learning models is commonly collected from multiple sources. Some of these might not be unreliable (noisy, corrupted, or even manipulated). Can learning algorithms overcome this an still learn classifiers of optimal accuracy and ideally fairness? In my talk, I highlight recent results from our group that establish situations in which this is possible or impossible.
In this talk, I will discuss a remarkable phenomenon of wide neural networks: transition to linearity. The phenomenon can be described as follow: as network width increases to infinity, the neural network becomes a linear model with respect to its parameters, in any O(1)-neighborhoods of the random initialization. This phenomenon underlines the widely known constancy of neural tangent kernel of certain infinitely wide neural networks. Aiming to give a more intuitive understand, I will describe in a geometric point of view, how this somehow unexpected phenomenon, as well as the constancy of NTK, arises as a natural consequence of assembling a large number of submodels.
Reinforcement learning (RL) is notoriously sample inefficient, prompting a large body of prior work to study how unsupervised pretraining can improve sample efficiency when solving downstream RL tasks. One approach is unsupervised skill discovery, a class of algorithms that learn a set of policies without access to a reward function. While prior work has shown that these methods learn skills that can accelerate downstream RL tasks, it remains unclear whether these methods always learn useful skills. What does it even mean for a skill (or set of skills) to be probably useful? In this talk, I'll share some recent work that provides some of the first answers to this question: existing methods are optimal one one sense but very suboptimal in a different sense. These results have implications for how users might want to use skill learning algorithms, provide some (surprisingly simple!) tools for analysing skill learning methods, and suggest exciting opportunities for designing new, provably-useful skill learning algorithms.
Most practical algorithms for supervised machine learning boil down to optimizing the average performance over a training dataset. However, it is increasingly recognized that although the optimization objective is the same, the manner in which it is optimized plays a decisive role in the properties of the resulting predictor. For example, when training large neural networks, there are generally many weight combinations that will perfectly fit the training data. However, gradient-based training methods somehow tend to reach those which, on the one hand, do not overfit, and on the other hand, are very brittle to adversarially crafted examples. Why the dynamics of these methods lead to such "implicit biases" is still far from being fully understood. In this talk, I'll describe several recent theoretical results related to this question, in the context of benign overfitting and adversarial examples. Based on joint work with Gal Vardi and Gilad Yehudai.
We study the convergence properties of gradient descent for training deep linear neural networks, i.e., deep matrix factorizations, by extending a previous analysis for the related gradient flow. We show that under suitable conditions on the step sizes gradient descent converges to a critical point of the loss function, i.e., the square loss in this work. Furthermore, we demonstrate that for almost all initializations gradient descent converges to a global minimum in the case of two layers. In the case of three or more layers we show that gradient descent converges to a global minimum on the manifold matrices of some fixed rank, where the rank cannot be determined a priori.
Since the proposal of the graph neural network (GNN) by Gori et al. (2005) and Scarselli et al. (2008), one of the major problems in training GNNs was their struggle to propagate information between distant nodes in the graph. We propose a new explanation for this problem: GNNs are susceptible to a bottleneck when aggregating messages across a long path. This bottleneck causes the over-squashing of exponentially growing information into fixed-size vectors. As a result, GNNs fail to propagate messages originating from distant nodes and perform poorly when the prediction task depends on long-range interaction. Initially, we hypothesized that thanks to their attention mechanism, Graph Attention Networks (GAT) are more resistant to over-squashing than GNNs that absorb incoming edges equally, such as GCN and GIN. However, we found that GAT computes a very limited kind of attention: the ranking of the attention scores is unconditioned on the query node. We formally define this restricted kind of attention as static attention and distinguish it from a strictly more expressive dynamic attention. Because GATs use a static attention mechanism, there are simple graph problems that GAT cannot express. To remove this limitation, we introduce a simple fix by modifying the order of operations and propose GATv2, which is now available as part of the PyTorch Geometric library, the Deep Graph Library, and the TensorFlow GNN library. The talk will focus on the following papers:On the Bottleneck of Graph Neural Networks and its Practical Implications (ICLR'2021): https://arxiv.org/pdf/2006.05205How Attentive are Graph Attention Networks? (ICLR'2022): https://arxiv.org/pdf/2105.14491
Since the celebrated works of Russo and Zou (2016, 2019) and Xu and Raginsky (2017), it has been well known that the generalization error of supervised learning algorithms can be bounded in terms of the mutual information between their input and the output, given that the loss of any fixed hypothesis has a subgaussian tail. In this work, we generalize this result beyond the standard choice of Shannon's mutual information to measure the dependence between the input and the output. Our main result shows that it is indeed possible to replace the mutual information by any strongly convex function of the joint input-output distribution, with the subgaussianity condition on the losses replaced by a bound on an appropriately chosen norm capturing the geometry of the dependence measure. This allows us to derive a range of generalization bounds that are either entirely new or strengthen previously known ones. Examples include bounds stated in terms of p-norm divergences and the Wasserstein-2 distance, which are respectively applicable for heavy-tailed loss distributions and highly smooth loss functions. Our analysis is entirely based on elementary tools from convex analysis by tracking the growth of a potential function associated with the dependence measure and the loss function.
Feedforward neural networks explicitly describe a series of computations to be done given an input data point b. Implicit networks, on the other hand, describe a condition which must be met, e.g. a fixed point equation depending on b. Recent work has shown implicit networks can match the performance of feedforward networks on common machine learning tasks such as image recognition, while training using a fraction of the memory. In this talk we discuss recent advances in training implicit neural networks, as well as novel applications to problems in which the target output can naturally be thought of as a fixed point (e.g. game theory).
Understanding the properties of neural networks trained via gradient descent is at the heart of the theory of deep learning. In this talk, I will discuss two approaches to study the behavior of gradient descent methods. The first one takes a mean-field view and it relates the dynamics of stochastic gradient descent (SGD) to a certain Wasserstein gradient flow in probability space. I will show how this idea allows to study the connectivity, convergence and implicit bias of the solutions found by SGD. The second approach consists in the analysis of the Neural Tangent Kernel. I will present tight bounds on its smallest eigenvalue and show their implications on memorization and optimization in deep networks. Based on joint work with Adel Javanmard, Vyacheslav Kungurtsev, Andrea Montanari, Guido Montufar, Quynh Nguyen, and Alexander Shevchenko.
Deep reinforcement learning has enabled machines to solve complex control problems directly from high-dimensional camera inputs. However, these systems rely on carefully designed reward functions for specific tasks. On the other hand, humans learn about the world and perform complex behaviors without any external reward signal. We categorize the space of possible objective functions for embodied agents. We show a spectrum that reaches from narrow to general objectives. While the narrow objectives correspond to domain-specific rewards as typically used in reinforcement learning today, the general objectives correspond to information maximization through world models. This explains unsupervised learning, perception, exploration, skill discovery, and control from a single principle. Our findings suggest designing powerful world models as a path toward building highly adaptive agents that seek out large niches in their environments, rendering task rewards optional.
Training of neural networks is hard to describe theoretically due to complicated non-linear dependence of network predictions on parameters. However, the situation greatly simplifies in the limit of infinite network width, where the problem becomes quadratic with the matrix given by Neural Tangent Kernel (NTK). Such problems are more amenable to theoretical analysis and mostly described by spectral properties of linear operator and target function. In the first part of the talk, we will show that in certain scenarios spectrum of the NTK and eigendecomposition of target function are asymptotically described by power laws with simple explicit expression for their exponents. In the second part of the talk we will turn to general quadratic problems with power-law spectrum and give tight bounds for convergence speed of various Gradient Descent algorithms: vanilla Gradient Descent (GD), Heavy Ball (HB) method, GD and HB with predefined schedules, Steepest Descent and Conjugate Gradients. The talk is based on the joint work with Dmitry Yarotsky (arXiv:2105.00507 and arXiv:2202.00992).
We characterize the power-law asymptotics of learning curves for Gaussian process regression (GPR) under the assumption that the eigenspectrum of the prior and the eigenexpansion coefficients of the target function follow a power law. Under similar assumptions, we leverage the equivalence between GPR and kernel ridge regression (KRR) to show the generalization error of KRR. Infinitely wide neural networks can be related to GPR with respect to the neural network GP kernel and the neural tangent kernel, which in several cases is known to have a power-law spectrum. Hence our methods can be applied to study the generalization error of infinitely wide neural networks. We present toy experiments demonstrating the theory. This is a joint work with Pradeep Banerjee and Guido Montufar.
Inspired by the proposal of tangent kernels of neural networks (NNs), a recent research line aims to design kernels with a better generalization performance on standard datasets. Indeed, a few recent works showed that certain kernel machines perform as well as NNs on certain datasets, despite their separations in specific cases implied by theoretical results. Furthermore, it was shown that the induced kernels of convolutional neural networks perform much better than any former handcrafted kernels. These empirical results pose a theoretical challenge to understanding the performance gaps in kernel machines and NNs in different scenarios.
In this talk, we show that data structures play an essential role in inducing these performance gaps. We consider a few natural data structures, and study their effects on the performance of these learning methods. Based on a fine-grained high dimensional asymptotics framework of analyzing random features models and kernel machines, we show the following: 1) If the feature vectors are nearly isotropic, kernel methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best low-dimensional representation; 2) If the feature vectors display the same low-dimensional structure as the target function (the spiked covariates model), this curse of dimensionality becomes milder, and the performance gap between kernel methods and NNs become smaller; 3) On datasets that display some invariance structure (e.g., image dataset), there is a quantitative performance gain of using invariant kernels (e.g., convolutional kernels) over inner product kernels. Beyond explaining the performance gaps, these theoretical results can further provide some intuitions towards designing kernel methods with better performance.
Multi-armed bandit algorithms minimize experimentation costs required to converge on optimal behavior. They do so by rapidly adapting experimentation effort away from poorly performing actions as feedback is observed. But this desirable feature makes them sensitive to confounding, which is the primary concern underlying classical randomized controlled trials. We highlight, for instance, that popular bandit algorithms cannot address the problem of identifying the best action when day-of-week effects may confound inferences. In response, this paper proposes deconfounded Thompson sampling, which makes simple, but critical, modifications to the way Thompson sampling is usually applied. Theoretical guarantees suggest the algorithm strikes a delicate balance between adaptivity and robustness to confounding. It attains asymptotic lower bounds on the number of samples required to confidently identify the best action --- suggesting optimal adaptivity --- but also satisfies strong performance guarantees in the presence of day-of-week effects and delayed observations --- suggesting unusual robustness. These issues are explored through a very general model of contextual bandit experiments. At the core of the paper is a new model of contextual bandit experiments in which issues of delayed learning and distribution shift arise organically.
Deep learning is nowadays used in a wide range of highly complex applications such as natural language processing or even scientific applications. Its first major breakthrough, however, was achieved by shattering the state-of-the-art in image classification. We revisit the problem of classification by deep neural networks and attempt to find an answer to why deep networks are remarkably effective in this regime. We will interpret the learning of classifiers as finding piecewise constant functions from labelled samples. We then precisely link the hardness of the learning problem to the complexity of the regions. Concretely, we will establish fundamental lower bounds on the learnability of certain regions. Finally, we will show that in many cases, these optimal bounds can be achieved by deep-neural-network-based learning.
In quite realistic settings, we will observe that deep neural networks can learn high-dimensional classifiers without a strong dependence of the learning rates on the dimension.
Adaptive gradient methods such as Adam have gained increasing popularity in deep learning optimization. However, it has been observed that compared with (stochastic) gradient descent, Adam can converge to a different solution with a significantly worse test error in many deep learning applications such as image classification, even with a fine-tuned regularization. In this talk, I will discuss some recent results on the convergence and generalization of Adam in training neural networks, and give a theoretical explanation for the difference between Adam and SGD. I will also present a new deep learning optimizer called the partially adaptive momentum estimation method, which achieves faster convergence rates and smaller test errors than Adam and SGD with momentum on various deep learning tasks.
Bio:
Yuan Cao is an assistant professor in the Department of Statistics and Actuarial Science and Department of Mathematics at the University of Hong Kong. Before joining HKU, he was postdoctoral scholar at UCLA working with Professor Quanquan Gu. He received his B.S. from Fudan University and Ph.D. from Princeton University. Yuan’s research interests include the theory of deep learning, non-convex optimization, and high-dimensional statsitcs. He has published research papers in top machine learning journals (ML) and conferences (NeurIPS, ICML, AAAI, IJCAI, etc.), including a spotlight presentation in NeurIPS 2019 and a long talk in ICML 2021.
I would demonstrate that a neural network (NN) learns training data as simple as it can, resembling an implicit Occam's Razor, from the following two viewpoints. First, the NN output often follows a frequency principle, i.e., learning data from low to high frequency. Second, we prove an embedding principle that the loss landscape of a NN "contains" all the critical points of all the narrower NNs. The embedding principle provides a basis for the condensation phenomenon, i.e., the NN weights condense on isolated directions when initialized small, which means the effective NN size is much smaller than its actual size, i.e., a simple representation of the training data.
Bio:
Zhi-Qin John Xu is an associate professor at Shanghai Jiao Tong University (SJTU). Zhi-Qin obtain B.S. in Physics (2012) and a Ph.D. degree in Mathematics (2016) from SJTU. Before joining SJTU, Zhi-Qin worked as a postdoc at NYUAD and Courant Institute from 2016 to 2019.
This talk will discuss certain peculiarities of training neural networks with batch normalization and weight decay, which has become a common practice in recent years. It turns out that their combined use may result in a surprising periodic behavior of optimization dynamics: the training process regularly exhibits destabilizations that, however, do not lead to complete divergence but cause a new period of training. We will delve deeper into the mechanism underlying the discovered periodic behavior, both empirically and theoretically, and analyze the conditions in which it occurs in practice. We will also demonstrate that periodic behavior can be regarded as a generalization of two previously opposing perspectives on training with batch normalization and weight decay, namely the equilibrium presumption and the instability presumption.
We propose a new geometric method for measuring the quality of representations obtained from deep learning. Our approach, called "Random Polytope Descriptor", provides an efficient description of data points based on the construction of random convex polytopes. We demonstrate the use of our technique by qualitatively comparing the behavior of classic and regularized autoencoders. This reveals that applying regularization to autoencoder networks may decrease the out-of-distribution detection performance in latent space. While our technique is similar in spirit to k-means clustering, we achieve significantly better false positive/negative balance in clustering tasks on autoencoded datasets. Joint work with Marek Kaluba and Lukas Ruff.
Deep learning relies on Artificial Neural Networks (ANNs) with deep architectures—machine learning models that have reached unparalleled performance in many domains, such as machine translation, autonomous vehicles, computer vision, text generation, and speech understanding. However, this impressive performance typically requires large datasets and massive ANN models. Gathering the data and training the models—all can take a long time and have prohibitive costs. Significant research efforts are being invested in improving ANN training efficiency, i.e. the time, data, and resources required to train these models. For example, this is done by changing the model (e.g., architecture, numerical precision) or the training algorithm (e.g., parallelization). However, such modifications often cause an unexplained degradation in the generalization performance of the ANN to unseen data. Recent findings suggest that this degradation is caused by changes to the hidden algorithmic bias of the training algorithm and model. This bias determines which solution is selected from all solutions which fit the data. I will discuss my works focused on understanding and controlling such algorithmic bias.
Bio:
Daniel is an associate professor in the Department of Electrical Engineering at the Technion, working in the areas of machine learning and theoretical neuroscience. He did his post-doc (as a Gruss Lipper fellow) working with Prof. Liam Paninski in the Department of Statistics and the Center for Theoretical Neuroscience at Columbia University. He is interested in all aspects of neural networks and deep learning. His recent works focus on quantization, resource efficiency, and implicit bias in neural networks. He is the recipient of the Gruss Lipper fellowship, the Goldberg Award, and Intel's Rising Star Faculty Award.
We contribute to a better understanding of the class of functions that is represented by a neural network with ReLU activations and a given architecture. Using techniques from mixed-integer optimization, polyhedral theory, and tropical geometry, we provide a mathematical counterbalance to the universal approximation theorems which suggest that a single hidden layer is sufficient for learning tasks. In particular, we investigate whether the class of exactly representable functions strictly increases by adding more layers (with no restrictions on size). We also present upper bounds on the sizes of neural networks required for exact function representation. This is joint work with Amitabh Basu, Marco Di Summa, and Martin Skutella.
Owing to the rapid development of big data and computing technology, deep learning method have shown successful results in various application fields. In view of theoretical studies, however, the deep neural networks are still trained by using the error backpropagation method, which has been used for training conventional multilayer perceptron, and its undesirable learning behaviors such as plateau are inherently maintained. Moreover, as the complexity of the model increases, it is likely to that such undesirable phenomena appear in various forms, making it difficult to fully understand their properties. In this talk, I introduce an approach to understand the strange learning behavior of neural networks, focusing on the complex singular structure of the neuromanifold. In order to investigate the effect of singularities on learning dynamics, we define three different types of learning scenarios according to the positional relationship between optimal and singular points. In these scenarios, we trace evolutions of generalization error during learning by using statistical mechanical method, and show that the three learning scenarios have different dynamical properties. Especially, in the near-singular scenario with over-parameterized model, we reveal the quasi-plateau phenomena, which are different type of slow dynamics from the well-known plateau. Through further analysis on average learning equations around the singular points, we show that there are two different types of slow manifolds associated with plateau and quasi-plateau, respectively. Additionally, we also show that the natural gradient learning, which is developed by considering geometrical structure, can alleviate the slow convergence caused by plateaus and quasi-plateaus.
Despite their outstanding performance, deep neural networks (DNNs) are susceptible to adversarial examples, imperceptibly perturbed examples causing mis-classification. Similarly, but less studied, DNNs are fragile in terms of perturbations in their weights. This talk highlights my recent research on both input and weight robustness and investigates how both problems are related. On the subject of adversarial examples, I discuss a confidence-calibrated version of adversarial training that allows to obtain robustness beyond the adversarial perturbations seen during training. Next, regarding weight robustness, I address robustness against random bit errors in the (quantized) weights which plays an important role in improving the energy-efficiency of DNN accelerators. Surprisingly, improved weight robustness can also be beneficial in terms of robustness against adversarial examples. Specifically, weight robustness can be thought of as flatness in the loss landscape with respect to perturbations of the weights. Using an intuitive flatness measure for adversarially trained DNNs, I demonstrate that flatness in the weight loss landscape improves adversarial robustness and helps to avoid robust overfitting.
Bio:
David Stutz is a final-year PhD student at the Max Planck Institute for Informatics supervised by Prof. Bernt Schiele and co-supervised by Prof. Matthias Hein from the University of Tübingen. He obtained his bachelor and master degrees in computer science from RWTH Aachen University. During his studies, he completed an exchange program with the Georgia Institute of Technology as well as several internships at Microsoft, Fyusion and Hyundai MOBIS, among others. He wrote his master thesis at the Max Planck Institute for Intelligent Systems supervised by Prof. Andreas Geiger. His PhD research focuses on obtaining robust deep neural networks, considering adversarial examples, corrupted examples or out-of-distribution examples. In a collaboration with IBM Research, subsequent work improves robustness against bit errors in (quantized) weights to enable energy-efficient and secure accelerators. This work was awarded an outstanding paper award at the CVPR CV-AML Workshop 2021. More recently, during an internship at DeepMind, he used conformal prediction for uncertainty estimation in medical diagnosis. He received several awards and scholarships including the Qualcomm Innovation Fellowship, RWTH Aachen University's Springorum Denkmünze and the STEM Award IT sponsored by ZF Friedrichshafen. His work has been published at top venues in computer vision and machine learning including ICCV, CVPR, IJCV, ICML and MLSys. More information can be found at www.davidstutz.de.
We consider approximate Bayesian model choice for model selection problems that involve models whose Fisher information matrices may fail to be invertible along other competing submodels. Such singular models do not obey the regularity conditions underlying the derivation of Schwarz's Bayesian information criterion BIC and the penalty structure in BIC generally does not reflect the frequentist large sample behaviour of the marginal likelihood. Although large sample theory for the marginal likelihood of singular models has been developed recently, the resulting approximations depend on the true parameter value and lead to a paradox of circular reasoning. Guided by examples such as determining the number of components in mixture models, the number of factors in latent factor models or the rank in reduced rank regression, we propose a resolution to this paradox and give a practical extension of BIC for singular model selection problems.
Optimization of many deep learning hyperparameters can be formulated as a bilevel optimization problem. While most black-box and gradient-based approaches require many independent training runs, we aim to adapt hyperparameters online as the network trains. The main challenge is to approximate the response Jacobian, which captures how the minimum of the inner objective changes as the hyperparameters are perturbed. To do this, we introduce the self-tuning network (STN), which fits a hypernetwork to approximate the best response function in the vicinity of the current hyperparameters. Differentiating through the hypernetwork lets us efficiently approximate the gradient of the validation loss with respect to the hyperparameters. We train the hypernetwork and hyperparameters jointly. Empirically, we can find hyperparameter settings competitive with Bayesian Optimization in a single run of training, and in some cases find hyperparameter schedules that outperform any fixed hyperparameter value.
One fundamental problem in deep learning is understanding the excellent performance of deep Neural Networks (NNs) in practice. An explanation for the superiority of NNs is that they can realize a large family of complicated functions, i.e., they have powerful expressivity. The expressivity of a Neural Network with Piecewise Linear activations (PLNN) can be quantified by the maximal number of linear regions it can separate its input space into. In this talk, we provide several mathematical results needed for studying the linear regions of Piecewise Linear Convolutional Neural Networks (PLCNNs), and use them to derive the maximal and average numbers of linear regions for one-layer PLCNNs. Furthermore, we obtain upper and lower bounds for the number of linear regions of multi-layer PLCNNs. Rectified Linear Unit (ReLU) is a piecewise linear activation function that has been widely adopted in various architectures. Our results suggest that deeper ReLU CNNs have more powerful expressivity than their shallow counterparts, while ReLU CNNs have more expressivity than fully-connected ReLU NNs per parameter, in terms of the number of linear regions.
Bio:
Dr. Huan Xiong is an Assistant Professor at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI). He received the B.S. and M.S. degrees from the School of Mathematical Sciences, Peking University, China, in 2010 and 2013, respectively, and the Ph.D. degree from the University of Zurich, Switzerland, in 2016. He was a postdoctoral researcher at the University of Strasbourg, France. His research interests include machine learning and discrete mathematics. He has published over 30 papers in top conferences/journals such as ICML, NeurIPS, CVPR and TPAMI.
The posterior over Bayesian neural network (BNN) parameters is extremely high-dimensional and non-convex. For computational reasons, researchers approximate this posterior using inexpensive mini-batch methods such as mean-field variational inference or stochastic-gradient Markov chain Monte Carlo (SGMCMC). To investigate foundational questions in Bayesian deep learning, we instead use full-batch Hamiltonian Monte Carlo (HMC) on modern architectures. We show that (1) BNNs can achieve significant performance gains over standard training and deep ensembles; (2) a single long HMC chain can provide a comparable representation of the posterior to multiple shorter chains; (3) in contrast to recent studies, we find posterior tempering is not needed for near-optimal performance, with little evidence for a "cold posterior" effect, which we show is largely an artifact of data augmentation; (4) BMA performance is robust to the choice of prior scale, and relatively similar for diagonal Gaussian, mixture of Gaussian, and logistic priors; (5) Bayesian neural networks show surprisingly poor generalization under domain shift; we demonstrate, explain and provide remedies for this effect; (6) while cheaper alternatives such as deep ensembles and SGMCMC methods can provide good generalization, they provide distinct predictive distributions from HMC. Notably, deep ensemble predictive distributions are similarly close to HMC as standard SGLD, and closer than standard variational inference.
References
I will talk about two of our recent papers:
[1] What Are Bayesian Neural Network Posteriors Really Like? Pavel Izmailov, Sharad Vikram, Matthew D. Hoffman, Andrew Gordon Wilson
[2] Dangers of Bayesian Model Averaging under Covariate Shift: Pavel Izmailov, Patrick Nicholson, Sanae Lotfi, Andrew Gordon Wilson
Other useful references:
[3] Bayesian Deep Learning and a Probabilistic Perspective of Generalization: Andrew Gordon Wilson, Pavel Izmailov
[4] How Good is the Bayes Posterior in Deep Neural Networks Really? Florian Wenzel, Kevin Roth, Bastiaan S. Veeling, Jakub Świątkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, Sebastian Nowozin
[5] Bayesian Neural Network Priors Revisited: Vincent Fortuin, Adrià Garriga-Alonso, Florian Wenzel, Gunnar Rätsch, Richard Turner, Mark van der Wilk, Laurence Aitchison
Relative information (relative entropy, KL divergence) and variational inference are powerful tools for deriving learning algorithms and their asymptotic properties, for both static systems and dynamic systems. The goal of this talk is to motivate a general online stochastic learning algorithm for stochastic processes with latent variables or memory, that provably converges under some regularity conditions. Please visit https://bit.ly/3kmovql for details.
In the first half of the talk, we study static systems, viewing maximum likelihood and Bayesian inference through the lens of relative information. In particular, their generalization errors may be derived by resolving the singularities of relative information. We then frame the two learning algorithms as special cases of variational inference with different computational constraints.
In the second half of the talk, we study dynamic systems, extending this variational inference method and computational perspective to stochastic processes and online learning. In particular, the training objective function will be a form of relative information which can be optimized iteratively in a way similar to expectation-maximization. The relative information objective provides a precise way to discuss the trade-off between exploration and exploitation during training.
In this talk, I will present some of my work on the functional space associated with neural networks. I will focus on simple classes of networks, including feedforward networks with linear and polynomial activations and two-layer ReLU networks, that provide a tractable setting where many geometric properties of general networks can be studied in detail. In particular, I will emphasize the distinction between the intrinsic function space and its parameterization, in order to shed light on the impact of the architecture on the expressivity of a model and on the corresponding optimization landscapes.
Work done outside Amazon.
Often, the training of many machine learning models can be formulated as a single-objective optimization (minimization) problem, which can be efficiently solved by gradient-based optimization methods. However, there is a growing number of models that involve multiple interacting objectives. Differentiable game is a generalization of the standard single-objective optimization framework, allowing us to model multiple players and objectives. However, new issues and challenges arise in solving differentiable games. Standard gradient descent-ascent algorithm could either converge to non-equilibrium fixed points or converge with a slow rate (if it does converge). In this talk, I will present some of my work attacking both problems. Besides, I will introduce a unified and systematic framework for global convergence analysis of first-order methods in solving strongly-monotone games.
There has been enormous progress in the last few years in designing conceivable (though not always practical) neural networks that respect the gauge symmetries - or coordinate freedom - of physical law. Some of these frameworks make use of irreducible representations, some make use of higher order tensor objects, and some apply symmetry-enforcing constraints. Different physical laws obey different combinations of fundamental symmetries, but a large fraction (possibly all) of classical physics is equivariant to translation, rotation, reflection (parity), boost (relativity), and permutations. Here we show that it is simple to parameterize universally approximating polynomial functions that are equivariant under these symmetries, or under the Euclidean, Lorentz, and Poincaré groups, at any dimensionality d. The key observation is that nonlinear O(d)-equivariant (and related-group-equivariant) functions can be expressed in terms of a lightweight collection of scalars -- scalar products and scalar contractions of the scalar, vector, and tensor inputs. These results demonstrate theoretically that gauge-invariant deep learning models for classical physics with good scaling for large problems are feasible right now.
Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another. In particular, when the architectures are identical, this is called self-distillation. The idea is to feed in predictions of the trained model as new target values for retraining (and iterate this loop possibly a few times). It has been empirically observed that the self-distilled model often achieves higher accuracy on held out data. Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training. This talk will provide a rigorous theoretical analysis of self-distillation. We focus on fitting a nonlinear function to training data, where the model space is Hilbert space and fitting is subject to L2 regularization in this function space. We show that self-distillation iterations modify regularization by progressively limiting the number of basis functions that can be used to represent the solution. This implies (as we also verify empirically) that while a few rounds of self-distillation may reduce over-fitting, further rounds may lead to under-fitting and thus worse performance.
This is joint work with Mehrdad Farajtabar (DeepMind) and Peter Bartlett (UC Berkeley).
Bio:
Hossein Mobahi is a research scientist at Google Research. His current interests relate to the theory of deep learning. Prior to joining Google in 2016, he was a postdoctoral researcher in CSAIL at MIT. He obtained his PhD in Computer Science from the University of Illinois at Urbana-Champaign (UIUC). He has co-organized the ICML 2018 Workshop “Modern Trends in Nonconvex Optimization for Machine Learning”, ICML 2019 Workshop “Understanding and Improving Generalization in Deep Learning”, and NeurIPS 2020 Competition “Predicting Generalization in Deep Learning”.
Optimization of many deep learning hyperparameters can be formulated as a bilevel optimization problem. While most black-box and gradient-based approaches require many independent training runs, we aim to adapt hyperparameters online as the network trains. The main challenge is to approximate the response Jacobian, which captures how the minimum of the inner objective changes as the hyperparameters are perturbed. To do this, we introduce the self-tuning network (STN), which fits a hypernetwork to approximate the best response function in the vicinity of the current hyperparameters. Differentiating through the hypernetwork lets us efficiently approximate the gradient of the validation loss with respect to the hyperparameters. We train the hypernetwork and hyperparameters jointly. Empirically, we can find hyperparameter settings competitive with Bayesian Optimization in a single run of training, and in some cases find hyperparameter schedules that outperform any fixed hyperparameter value.
It is well-known that learning statistical associations from finite data requires regularization to avoid overfitting. In other words, regularization terms penalize too complex functions to lower the risk that the functions capture random noise in the data. However, in the limit of infinite sample size, one can still learn arbitrarily complex statistical relations. I argue that regularization is even recommended in the population limit if one is interested in a causal model rather than a statistical model. This is because regularization can also mitigate bias from hidden common causes. This can be seen for a simple linear and non-linear regression task, where I show a very explicit formal analogy between finite sample and confounding bias. My theoretical results suggest that learning causal relations in the presence of hidden common causes should use particularly simple models.
Paper: D. Janzing: Causal regularization, NeurIPS 2019.
Wasserstein distances has recently seen a surge of applications in statistics and machine learning. This stems from many advantageous properties they possess, such as metric and topological structure of Wasserstein spaces, robustness to support mismatch, compatibility to gradient-based optimization, and rich geometric properties. In practice, we rarely have access to the actual distribution and only get data from it, which necessitates estimating the distance from samples. A central issue is that such estimators suffer from the curse of dimensionality: their empirical convergence rate scales as n^{-1/d} for d-dimensional distributions. This rate deteriorates exponentially fast with dimension, making it impossible to obtain meaningful accuracy guarantees, especially given the dimensionality of real-world data. This talk will present a novel framework of smooth p-Wasserstein distances, that inherits the properties of their classic counterparts while alleviating the empirical curse of dimensionality. Specifically, we will show that the empirical approximation error of the smooth distance decays as n^{-1/2}, in all dimensions. For the special case of the smooth 1-Wasserstein distance, we will also derive a high-dimensional limit distribution, further highlighting the favorable statistical behavior of the smooth framework. The derivation of statistical efficiency for the general pth order distance employs a new comparison result to the smooth dual Sobolev norm. Applications to implicit generative modeling will be considered, leveraging the parametric empirical convergence rates to establish n^{-1/2} generalization bounds in any dimension. Bio: Ziv Goldfeld is an assistant professor in the School of Electrical and Computer Engineering, and a graduate field member in Computer Science and the Center of Applied Mathematics, at Cornell University. Before joining Cornell, he was a postdoctoral research fellow in LIDS at MIT, hosted by Yury Polyanskiy. Ziv graduated with a B.Sc., M.Sc., and Ph.D. (all summa cum laude) in Electrical and Computer Engineering from Ben Gurion University, Israel. His graduate advisor was Haim Permuter. Ziv's research interests include optimal transport theory, statistical learning theory, information theory, and high-dimensional and nonparametric statistics. He seeks to understand and design inference and information processing systems by formulating and solving mathematical models. Honors include the NSF CAREER Award, NSF CRII Award, the IBM University Award, and the Rothschild Postdoctoral Fellowship.
In this talk I will introduce the concept of algorithmic recourse, which aims to help individuals affected by an unfavorable algorithmic decision to recover from it. First, I will show that while the concept of algorithmic recourse is strongly related to counterfactual explanations, existing methods for the later do not directly provide practical solutions for algorithmic recourse, as they do not account for the causal mechanisms covering the world. Then, I will show theoretical results that prove the need of complete causal knowledge to guarantee recourse and show how algorithmic recourse can be useful to provide novel fairness definitions that short the focus from the algorithm to the data distribution. Such novel definition of fairness allows us to distinguish between situations where unfairness can be better addressed by societal intervention, as opposed to changes on the classifiers. Finally, I will show practical solutions for (fairness in) algorithmic recourse, in realistic scenarios where the causal knowledge is only limited.
The early phase of training of deep neural networks has a dramatic and counterintuitive effect on the local curvature of the loss function. For instance, we found that using a small learning rate does not guarantee stable optimization because the optimization trajectory has a tendency to steer towards regions of the loss surface with increasing local curvature. It is equally surprising that using a small learning rate impacts negatively generalization. I will discuss our journey in understanding these and other phenomena. The focus of the talk will be our mechanistic explanation for how using a large learning rate impacts generalization, which we corroborate by developing an explicit regularizer that reproduces its implicit regularization effects [1,2].
[1] The Break-Even Point on Optimization Trajectories of Deep Neural Networks, Jastrzebski et al, ICLR 2020
[2] Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization, Jastrzebski et al, ICML 2021
Statistical inverse problems lead to complex optimisation and/or Monte Carlo sampling problems. Gradient descent and Langevin samplers provide examples of widely used algorithms. In my talk, I will discuss recent results on sampling algorithms, which can be viewed as interacting particle systems, and their mean-field limits. I will highlight the geometric structure of these mean-field equations within the, so called, Otto calculus, that is, a gradient flow structure in the space of probability measures. Affine invariance is an important outcome of recent work on the subject, a property shared by Newton’s method but not by gradient descent or ordinary Langevin samplers. The emerging affine invariant gradient flow structures allow us to discuss coupling-based Bayesian inference methods, such as the ensemble Kalman filter, as well as invariance-of-measure-based inference methods, such as preconditioned Langevin dynamics, within a common mathematical framework. Applications include nonlinear and logistic regression.
A popular approach to learning encoders for lossy compression is to use additive uniform noise during training as a differentiable approximation to test-time quantization. We demonstrate that a uniform noise channel can also be implemented at test time using universal quantization (Ziv, 1985). This allows us to eliminate the mismatch between training and test phases while maintaining a completely differentiable loss function. Implementing the uniform noise channel is a special case of the more general problem of communicating a sample, which we prove is computationally hard if we do not make assumptions about its distribution. However, the uniform special case is efficient as well as easy to implement and thus of great interest from a practical point of view. Finally, we show that quantization can be obtained as a limiting case of a soft quantizer applied to the uniform noise channel, bridging compression with and without quantization.
The remarkable success of deep learning in computer science has evinced potentially great applications of deep learning in computational and applied mathematics. Understanding the mathematical principles of deep learning is crucial to validating and advancing deep learning-based PDE solvers. We present a few thoughts on the theoretical foundation of this topic for high-dimensional partial differential equations including approximation, optimization, and generalization. Though our analysis is not a complete story and there are many missing pieces to make it well-justified, it may still be helpful to provide some insights into deep learning.
Artificial Neural networks are achieving state of the art and sometimes superhuman performance on learning tasks across a variety of domains. Whenever these problems require learning in a continual or sequential manner, however, neural networks suffer from the problem of catastrophic forgetting; they forget how to solve previous tasks after being trained on a new task, despite having the essential capacity to solve both tasks if they were trained on both simultaneously. In this talk, we introduce this phenomena and propose a few methods to address this issue from a variety of aspects.
Bio:
Mehrdad Farajtabar is a research scientist in Google DeepMind working on machine learning and applications. His recent research interests are continual learning of neural networks, learning under evolving data distributions and reinforcement learning. Before joining DeepMind he graduated with PhD in computational science and engineering from Georgia Tech in 2018 and holds M.Sc. and B.Sc. degrees in Artificial Intelligence and Software Engineering from Sharif University of Technology.
The purpose of this talk is to highlight three recent directions in the study of implicit bias --- one of the current promising approaches to trying to develop a tight generalization theory for deep networks, one interwoven with optimization. The first direction is a warm-up with purely linear predictors: here, the implicit bias perspective gives the fastest known hard-margin SVM solver! The second direction is on the early training phase with shallow networks: here, implicit bias leads to good training and testing error, with not just narrow networks but also arbitrarily large ones. The talk concludes with deep networks, providing a variety of structural lemmas which capture foundational aspects of how weights evolve for any width and sufficiently large amounts of training.
Joint work with Ziwei Ji.
Bio: Matus Telgarsky is an assistant professor at the University of Illinois, Urbana-Champaign, specializing in deep learning theory. He was fortunate to receive a PhD at UCSD under Sanjoy Dasgupta. Other highlights include: co-founding, in 2017, the Midwest ML Symposium (MMLS) with Po-Ling Loh; receiving a 2018 NSF CAREER award; organizing a Simons Insititute summer 2019 program on deep learning with Samy Bengio, Aleskander Madry, and Elchanan Mossel.
Deep neural networks trained with gradient descent have been extremely successful at learning solutions to a broad suite of difficult problems across a wide range of domains such as vision, gameplay, and natural language, many of which had previously been considered to require intelligence. Despite their tremendous success, we still do not have a detailed, predictive understanding of how these systems work. In this talk, I will focus on recent efforts to understand the structure of deep neural network loss landscapes and how gradient descent navigates them during training. In particular, I will discuss a phenomenological approach to modelling their large-scale structure [1], the role of their nonlinear nature in the early phases of training [2], and its effects on ensembling and calibration. [3,4]
[1] Stanislav Fort, and Stanislaw Jastrzebski. “Large Scale Structure of Neural Network Loss Landscapes.” Advances in Neural Information Processing Systems 32 (NeurIPS 2019). arXiv 1906.04724
[2] Stanislav Fort et al. "Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel". NeurIPS 2020. arXiv 2010.15110
[3] Stanislav Fort, Huiyi Hu, Balaji Lakshminarayanan. "Deep Ensembles: A Loss Landscape Perspective." arXiv 1912.02757
[4] Marton Havasi et al. "Training independent subnetworks for robust prediction". ICLR 2021. arXiv 2010.06610
The pairwise interaction paradigm of graph machine learning has predominantly governed the modelling of relational systems. However, graphs alone cannot capture the multi-level interactions present in many complex systems and the expressive power of such schemes was proven to be limited. To overcome these limitations, we propose Message Passing Simplicial Networks (MPSNs), a class of models that perform message passing on simplicial complexes (SCs) - topological objects generalising graphs to higher dimensions. To theoretically analyse the expressivity of our model we introduce a Simplicial Weisfeiler-Lehman (SWL) colouring procedure for distinguishing non-isomorphic SCs. We relate the power of SWL to the problem of distinguishing non-isomorphic graphs and show that SWL and MPSNs are strictly more powerful than the WL test and not less powerful than the 3-WL test. We deepen the analysis by comparing our model with traditional graph neural networks with ReLU activations in terms of the number of linear regions of the functions they can represent. We empirically support our theoretical claims by showing that MPSNs can distinguish challenging strongly regular graphs for which GNNs fail and, when equipped with orientation equivariant layers, they can improve classification accuracy in oriented SCs compared to a GNN baseline. Additionally, we implement a library for message passing on simplicial complexes that we envision to release in due course.
Regularization is considered a key-concept in the explanation and analysis of successful learning algorithms. In contrast, modern machine learning practice often suggests invoking highly expressive models that can completely interpolate the data with far more free parameters than examples. To resolve this alleged contradiction the notion of implicit bias, or implicit regularization, has been suggested as a means to explain the surprising generalization ability of modern-day overparameterized learning algorithms. In this talk, we will revisit this paradigm in one of the most well-studied and well-understood models for theoretical machine learning: Stochastic Convex Optimization (SCO).
We begin by discussing new results that highlight the role of the optimization algorithm for learning. We give a new separation result that separates between the generalization performance of stochastic gradient descent (SGD) and of full-batch gradient descent (GD), as well as regularized GD. We show that while all algorithms optimize the empirical loss at the same rate, their generalization performance can be significantly different. We next discuss the implicit bias of Stochastic Gradient Descent (SGD) in this context and ask if the implicit bias accounts for the success of SGD to generalize. We provide several constructions that point out to significant difficulties in providing a comprehensive explanation of an algorithm's generalization performance by solely arguing about its implicit regularization properties.
On the one hand, these results demonstrate the importance of the optimization algorithm in generalization. On the other hand, they also hint that the reason or cause for the different performances may not necessarily be explained or understood via investigations of the algorithm's bias.
Based on joint works with: Idan Amir, Assaf Dauber, Meir Feder, Tomer Koren.
The human visual system is proof that it is possible to learn new categories with very few samples; humans do not need a million samples to learn to distinguish a poisonous mushroom from an edible one in the wild. Such ability, arguably, comes from having seen millions of other categories and transferring learnt representations to the new categories. This talk will present a formal connection of machine learning with thermodynamics to characterize the quality of learnt representations for transfer learning. We will discuss how information-theoretic functionals such as rate, distortion and classification loss lie on a convex, so-called, equilibrium surface. We prescribe dynamical processes to traverse this surface under constraints, e.g., an iso-classification process that modulates rate and distortion to keep the classification loss unchanged. We demonstrate how such processes allow complete control over the transfer from a source dataset to a target dataset and can guarantee the performance of the final model. We will also discuss information-geometric methods to characterize the distance between learning tasks. This talk will discuss results from https://arxiv.org/abs/1909.02729, https://arxiv.org/abs/2002.12406 and https://arxiv.org/abs/2011.00613.
BioPratik Chaudhari is an Assistant Professor in Electrical and Systems Engineering and Computer and Information Science at the University of Pennsylvania. He is a member of the General Robotics, Automation, Sensing and Perception (GRASP) Laboratory. From 2018-19, he was a Senior Applied Scientist at Amazon Web Services and a post-doctoral scholar in Computing and Mathematical Sciences at the California Institute of Technology. Pratik received his PhD (2018) in Computer Science from the University of California Los Angeles, his Master's (2012) and Engineer's (2014) degrees in Aeronautics and Astronautics from the Massachusetts Institute of Technology and his Bachelor’s degree (2010) from the Indian Institute of Technology Bombay. He was a part of NuTonomy Inc. (now Hyundai-Aptiv) from 2014-16 where he worked on urban autonomous navigation.
Natural Gradient Descent (NGD) helps to accelerate the convergence of gradient descent dynamics, but it requires approximations in large-scale deep neural networks because of its high computational cost. Empirical studies have confirmed that some NGD methods with approximate Fisher information converge sufficiently fast in practice. In this talk, we reveal that, under specific conditions, NGD with approximate Fisher information achieves the same fast convergence to global minima as exact NGD. We consider deep neural networks in the infinite-width limit, and analyze the asymptotic training dynamics of NGD in the neural tangent kernel regime. The training dynamics with the approximate Fisher information are identical to those with the exact Fisher information in the function space, and they converge quickly. This result holds in layer-wise, K-FAC, and unit-wise approximations. All of these different approximations have an isotropic gradient in the function space, and this plays a fundamental role in achieving the same fast convergence in training.
Can overparameterized neural networks trained by SGD provably generalize when the labels are corrupted with substantial random noise? We answer this question in the affirmative by showing that for a broad class of distributions, one-hidden-layer networks trained by SGDgeneralize when the distribution is linearly separable but corrupted with adversarial label noise, despite the capacity to overfit. Equivalently, such networks have classification accuracy competitive with that of the best halfspace over the distribution. Our results hold for networks of arbitrary width and for arbitrary initializations of SGD. In particular, we do not rely upon the approximations to infinite width networks that are typically used in theoretical analyses of SGD-trained neural networks.
Recently a number of empirical "universal" scaling law papers have been published, most notably by OpenAI. `Scaling laws' refers to power-law decreases of training or test error w.r.t. more data, larger neural networks, and/or more compute. In this work we focus on scaling w.r.t. data size n. Theoretical understanding of this phenomenon is largely lacking, except in finite-dimensional models for which error typically decreases with n^−1/2 or n^−1, where n is the sample size. We develop and theoretically analyse the simplest possible (toy) model that can exhibit n^−β learning curves for arbitrary power β>0, and determine whether power laws are universal or depend on the data distribution.
In this talk, I will first highlight how the behavior of normalized nets when trained by SGD departs from traditional optimization viewpoints in several different ways (e.g. use of exponentially increasing learning rates). Then I will present a formal framework for studying their mathematics via suitable adaptation of the conventional framework namely, modeling SGD-induced training trajectory via a suitable stochastic differential equation (SDE)driven by a Brownian motion. This yields: (a) A new ‘intrinsic learning rate’ parameter that is the product of the normal learning rate and weight decay factor. Analysis of the SDE shows how the effective speed of learning varies and equilibrates over time under the control of intrinsic LR. (b) A challenge -- via theory and experiments -- to popular belief that good generalization requires large learning rates at the start of training. (c) New experiments, backed by mathematical intuition, suggesting the number of steps to equilibrium (in function space) scales as the inverse of the intrinsic learning rate, as opposed to the exponential time convergence bound implied by SDE analysis. We name it the Fast Equilibrium Conjecture. Finally, I will discuss on the validity of such conventional SDE approximation of SGD.
The talk will be based on the following papers:
Zhiyuan Li, Sanjeev Arora, “An Exponential Learning Rate Schedule for Deep Learning”, ICLR 2020
Zhiyuan Li, Kaifeng Lyu, Sanjeev Arora, “Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate”, NeurIPS 2020
Zhiyuan Li, Sadhika Malladi, Sanjeev Arora, “On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs)”
In this talk, I will focus on the 'tail behavior' in SGD in deep learning. I will first empirically illustrate that heavy tails arise in the gradient noise (i.e., the difference between the stochastic gradient and the true gradient). Accordingly I will propose to model the gradient noise as a heavy-tailed α-stable random vector, and accordingly propose to analyze SGD as a discretization of a stochastic differential equation (SDE) driven by a stable process. As opposed to classical SDEs that are driven by a Brownian motion, SDEs driven by stable processes can incur ‘jumps’, which force the SDE (and its discretization) transition from 'narrow minima' to 'wider minima', as proven by existing metastability theory and the extensions that we proved recently. These results open up a different perspective and shed more light on the view that SGD 'prefers' wide minima. In the second part of the talk, I will focus on the generalization properties of such heavy-tailed SDEs and show that the generalization error can be controlled by the Hausdorff dimension of the trajectories of the SDE, which is closely linked to the tail behavior of the driving process. Our results imply that heavier-tailed processes should achieve better generalization; hence, the tail-index of the process can be used as a notion of "capacity metric”. Finally, I will talk about the 'originating cause' of such heavy-tailed behavior and present theoretical results which show that heavy-tails can even emerge in very sterile settings such as linear regression with iid Gaussian data.
The talk will be based on the following papers:
U. Şimşekli, L. Sagun, M. Gürbüzbalaban, "A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks", ICML 2019
T. H, Nguyen, U. Şimşekli, M. Gürbüzbalaban, G. Richard, "First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise", NeurIPS 2019
U. Şimşekli, O. Sener, G. Deligiannidis, M. A. Erdogdu, "Hausdorff Dimension, Stochastic Differential Equations, and Generalization in Neural Networks", NeurIPS 2020
M. Gurbuzbalaban, U. Simsekli, L. Zhu, "The Heavy-Tail Phenomenon in SGD", arXiv, 2020
I will discuss two recent works, which apply backward error analysis to describe the influence of finite learning rates on the dynamics of Gradient Descent [1] and Random Shuffling SGD [2]. In particular, I will show that, when using small finite learning rates, the path taken by Random Shuffling SGD coincides with the path taken by gradient flow when minimizing a modified loss function. This modified loss function contains an implicit source of regularization which enhances the test accuracies achieved by deep networks on standard benchmarks.
I will also briefly discuss an empirical study clarifying the key phenomena observed when training deep networks with SGD [3].
[1] Implicit Gradient Regularization, Barrett and Dherin, ICLR 2021
[2] On the Origin of Implicit Regularization in Stochastic Gradient Descent, Smith et al., ICLR 2021
[3] On the Generalization Benefit of Noise in Stochastic Gradient Descent, Smith et al., ICML 2020
Many learning problems (e.g. VAE, MDL, Information Bottleneck, various regularization techniques) involve a tradeoff between prediction error and some measure of complexity of the model or the representation, minimizing a linear combination of the form (prediction error) + beta*complexity for some hyperparameter beta that parametrizes the tradeoff. How does the learning change as we vary beta, and what is its relationship with the structure of the dataset and model capacity? How can we design practical objectives and algorithms to obtain good prediction with low complexity?
To gain insight, we study phase transitions, identifying beta-values where key quantities such as prediction accuracy change in a discontinuous way. We then introduce a general technique for analytically and algorithmically predicting phase transitions where the global minimum of the loss landscape transitions to a saddle point. For the Information Bottleneck, we derive accurate phase transition predictions that illuminated the relation between the objective, dataset structure, learned representation and model capacity, for example identifying classes that are easy vs hard to learn. We also draw close connection between the phase transition in the Information Bottleneck and the second-order phase transition in physics. Finally, I will introduce similar tradeoff phenomena in other learning scenarios, and point to open problems.
In this talk, I will overview recent results towards understanding how we learn large capacity machine learning models. In the modern practice of machine learning, especially deep learning, many successful models have far more trainable parameters compared to the number of training examples leading to ill-posed optimization objectives. In practice though, when such ill-posed objectives are minimized using local search algorithms like (stochastic) gradient descent ((S)GD), the "special" minimizers returned by these algorithms have remarkably good performance on new examples. In this talk, we will explore the role optimization algorithms like (S)GD in learning overparameterized models focusing on the simpler setting of learning linear predictors.
Bio: Suriya Gunasekar is a Senior Researcher in the Machine Learning Foundations group at Microsoft Research at Redmond. Prior to joining MSR, she was a Research Assistant Professor at Toyota Technological Institute at Chicago. She received her PhD in Electrical and Computer Engineering from The University of Texas at Austin.
In this talk, I discuss how deep learning can statistically outperform shallow methods such as kernel methods utilizing the notion of sparsity of a target function space, and present a non-convex optimization framework with a generalization and excess risk bounds. In the first half, I will summarize our recent work on the excess risk bounds of deep learning in the Besov space and its variants. It will be shown that the superiority of deep learning stems from sparsity of the target function space, and more essentially non-convex geometry of the space characterizes this property. In such a situation, deep learning can achieve the so-called adaptive estimation which gives a better excess risk than shallow methods. In the latter half, I present a deep learning optimization framework based on a noisy gradient descent in infinite dimensional Hilbert space (gradient Langevin dynamics), and show generalization error and excess risk bounds for the solution obtained by the optimization procedure. The proposed framework can deal with finite and infinite width networks simultaneously unlike existing one such as neural tangent kernel and mean field analysis.
Modern neural networks are typically trained in an over-parameterized regime where the parameters of the model far exceed the size of the training data. Such neural networks in principle have the capacity to (over)fit any set of labels including significantly corrupted ones. Despite this (over)fitting capacity, over-parameterized networks have an intriguing robustness capability: they are surprisingly robust to label noise when first order methods with early stopping are used to train them. Even more surprising, one can remove noise and corruption from a natural image without using any training data what-so-ever, by simply fitting (via gradient descent) a randomly initialized, over-parameterized convolutional generator to a single corrupted image. In this talk I will first present theoretical results aimed at explaining the robustness capability of neural networks when trained via early-stopped gradient descent. I will then present results towards demystifying untrained networks for image reconstruction/restoration tasks such as denoising and those arising in inverse problems such as compressive sensing.
While reinforcement learning algorithms provide automated acquisition of optimal policies, practical application of such methods requires a number of design decisions, such as manually designing reward functions that not only define the task, but also provide sufficient shaping to accomplish it. In this paper, we discuss a new perspective on reinforcement learning, recasting it as the problem of inferring actions that achieve desired outcomes, rather than a problem of maximizing rewards. To solve the resulting outcome-directed inference problem, we establish a novel variational inference formulation that allows us to derive a well-shaped reward function which can be learned directly from environment interactions. From the corresponding variational objective, we also derive a new probabilistic Bellman backup operator reminiscent of the standard Bellman backup operator and use it to develop an off-policy algorithm to solve goal-directed tasks. We empirically demonstrate that this method eliminates the need to design reward functions and leads to effective goal-directed behaviors.
Tim G. J. Rudner is a PhD Candidate in the Department of Computer Science at the University of Oxford, supervised by Yarin Gal and Yee Whye Teh. His research interests span Bayesian deep learning, reinforcement learning, and variational inference. He holds a master’s degree in statistics from the University of Oxford and an undergraduate degree in mathematics and economics from Yale University. Tim is also an AI Fellow at Georgetown University's Center for Security and Emerging Technology (CSET), a Fellow of the German National Academic Foundation, and a Rhodes Scholar.
We investigate two causes for adversarial vulnerability in deep neural networks: bad data and (poorly) trained models. When trained with SGD, deep neural networks essentially achieve zero training error, even in the presence of label noise, while also exhibiting good generalization on natural test data, something referred to as benign overfitting. However, these models are vulnerable to adversarial attacks. We identify label noise as one of the causes for adversarial vulnerability, and provide theoretical and empirical evidence in support of this. Surprisingly, we find several instances of label noise in datasets such as MNIST and CIFAR, and that robustly trained models incur training error on some of these, i.e. they don't fit the noise. However, removing noisy labels alone does not suffice to achieve adversarial robustness. Standard training procedures bias neural networks towards learning "simple" classification boundaries, which may be less robust than more complex ones. We observe that adversarial training does produce more complex decision boundaries. We conjecture that in part the need for complex decision boundaries arises from sub-optimal representation learning. By means of simple toy examples, we show theoretically how the choice of representation can drastically affect adversarial robustness.
Recent investigations into infinitely-wide deep neural networks have given rise to intriguing connections between deep networks, kernel methods, and Gaussian processes. Nonetheless, there are important dynamical regimes for finite-width neural networks that lie far outside the realm of applicability of these results. I will discuss how the choice of learning rate in gradient descent is a crucial factor that naturally classifies gradient descent dynamics of deep nets into two classes (a “lazy” regime and a “catapult” regime). These phases are separated by a sharp phase transition as deep networks become wider. I will describe the distinct phenomenological signatures of the two phases, how they are elucidated in a class of solvable simple models, and the implications for model performance.
Deep neural networks have achieved significant empirical success in many fields, including the fields of computer vision, machine learning, and artificial intelligence. Along with its empirical success, deep learning has been theoretically shown to be attractive in terms of its expressive power. However, the theory of the expressive power does not ensure that we can efficiently find an optimal solution in terms of optimization, robustness, and generalization, during the optimization process of a neural network. In this talk, I will discuss some theoretical results on optimization and the effect of mixup on robustness and generalization.
The mathematical heart of deep learning is gradient descent on a loss function L. If gradient descent converges, it will converge to a critical point of L. Thus the geometry of the locus of critical points is of great interest. We will discuss what is known about the critical points of L, including dimension estimates and connectedness results.
Deep neural networks are often considered to be complicated "black boxes," for which a full systematic analysis is not only out of reach but also impossible. In this talk, which is based on ongoing joint work with Sho Yaida and Daniel Adam Roberts, I will make the opposite claim. Namely, that deep neural networks with random weights and biases are perturbatively solvable models. Our approach applies to networks at finite width n and large depth L, the regime in which they are used in practice. A key point will be the emergence of a notion of "criticality," which involves a finetuning of model parameters (weight and bias variances). At criticality, neural networks are particularly well-behaved but still exhibit a tension between large values for n and L, with larger values of n tending to make neural networks more like Gaussian processes and large values of L amplifying higher cumulants. Our analysis at initialization has many consequences also for networks during after training, which I will discuss if time permits.
Top-down decision tree learning algorithms, such as CART, ID3, and C4.5, have been machine learning workhorses for decades. However, there hasn't been much theoretical work proving that those algorithms can effectively learn decision trees. In part 1 of this talk, we prove that a large class of top-down algorithms learn a decision tree with accuracy close to that of the best small decision tree as long as the dataset is monotone. In part 2, we develop a new splitting criterion with similar guarantees even if the dataset is not monotone.
This talk is based on joint work with Neha Gupta, Jane Lange, and Li-Yang Tan.
It is well-known that initialization could have huge impact on the performance of deep neural networks (DNNs). In this talk, focusing on the regression problems, I will present our empirical and theoretical studies about two types of influence of initialization on generalization of DNNs. The first type of influence is through a biased initial DNN output function, whereas the second type is through changing the behavior of training dynamics. I will also talk about the anti-symmetrical initialization (ASI) trick and other practical implications of our results.
In this talk, we show that there is a large gap between the maximum complexity of the functions that a neural network can express and the expected complexity of the functions that it learns in practice. Deep ReLU networks are piecewise linear functions, and the number of distinct linear regions is a natural measure of their expressivity. It is well-known that the maximum number of linear regions grows exponentially with the depth of the network, and this has often been used to explain the success of deeper networks. We show that the expected number of linear regions in fact grows polynomially in the size of the network, far below the exponential upper bound and independent of the depth of the network. This statement holds true both at initialization and after training, under natural assumptions for gradient-based learning algorithms. We also show that the linear regions of a ReLU network reveal information about the network's parameters. In particular, it is possible to reverse-engineer the weights and architecture of an unknown deep ReLU network merely by querying it.
We build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings. For instance, conditioned on the input signal, the output of a MASO DN can be written as a simple affine transformation of the input. This implies that a DN constructs a set of signal-dependent, class-specific templates against which the signal is compared via a simple inner product; we explore the links to the classical theory of optimal classification via matched filters and the effects of data memorization. We also study the forming of the spline partition of the input signal space that is implicitly induced by a MASO. This provides direct links from DNs to the theory of vector quantization (VQ) and K-means clustering, which opens up new geometric avenue to study how DNs organize signals in a hierarchical fashion.
In this talk, we show that the family of feedforward neural networks with ReLU activation is equivalent to the family of tropical rational maps. Given this observation, we relate decision boundaries of ReLU-activated networks to tropical hypersurfaces. And we study the expressiveness of these neural networks via the vertices of polytopes associated with tropical rational functions. As an application of this tropical formulation, we reproduce the upper bound on the number of linear regions achievable by deep ReLU neural networks.
In this talk, we introduce exact convex optimization formulations of multilayer neural network training problems with rectified linear units. Our theory leverages semi-infinite duality, gauge functions and minimum norm regularization. We show that two and three layer neural networks can be globally trained via a convex program with number of variables polynomial in the number of training samples and number of hidden neurons. To our knowledge, this is the first polynomial time algorithm that provably optimizes non-trivial neural network architectures in the literature. Our results provide an equivalent characterization of neural network models in terms of a hidden convex optimization landscape with convex regularization where the data matrix is partitioned over hyperplane arrangements. As a result, non-convex ReLU neural networks models can be precisely represented as convex regularization methods, where piecewise linear models are fitted via group L1 norm penalty. Moreover, we show that certain standard two and three layer convolutional neural networks can be globally optimized in fully polynomial time. In particular, our results reveal a hidden regularization mechanism behind convolutional nets and characterizes how the architecture and pooling strategies alter the regularizer. Finally, we present numerical simulations showing that standard local search heuristics such as stochastic gradient descent can be dramatically inefficient compared to the proposed convex program.
Statistical manifolds with dually flat structures, such as an exponential family, appear in various machine learning models. In this talk, I will introduce a close connection between dually flat manifolds and incidence algebras in order theory and present its application to machine learning. This approach allows us to flexibly design log-linear models equipped with partially ordered sample spaces, which include a number of machine learning problems such as learning of Boltzmann machines, tensor decomposition, and blind source separation. I will also talk about theoretical analysis of such models using Rissanen's stochastic complexity and draw the connection to the double descent phenomenon via model volumes.
Understanding deep learning calls for addressing the questions of: (i) optimization --- the effectiveness of simple gradient-based algorithms in solving neural network training programs that are non-convex and thus seemingly difficult; and (ii) generalization --- the phenomenon of deep learning models not overfitting despite having many more parameters than examples to learn from. Existing analyses of optimization and/or generalization typically adopt the language of classical learning theory, abstracting away many details on the setting at hand. In this talk I will argue that a more refined perspective is in order, one that accounts for the dynamics of the optimizer. I will then demonstrate a manifestation of this approach, analyzing the dynamics of gradient descent over linear neural networks. We will derive what is, to the best of my knowledge, the most general guarantee to date for efficient convergence to global minimum of a gradient-based algorithm training a deep network. Moreover, in stark contrast to conventional wisdom, we will see that sometimes, adding (redundant) linear layers to a classic linear model significantly accelerates gradient descent, despite the introduction of non-convexity. Finally, we will show that such addition of layers induces an implicit bias towards low rank (different from any type of norm regularization), and by this explain generalization of deep linear neural networks for the classic problem of low rank matrix completion.
Works covered in this talk were in collaboration with Sanjeev Arora, Noah Golowich, Elad Hazan, Wei Hu, Yuping Luo and Noam Razin.
This talk will talk about two results regarding gradient-based sampling methods. First, I will present upper bounds for the problem of sampling from a distribution , where the potential function is smooth everywhere and strongly convex outside a ball of radius , but potentially nonconvex inside this ball. We study both overdamped and underdamped Langevin MCMC and establish upper bounds on the number of steps required to obtain a sample from a distribution that is within of the target distribution in 1-Wasserstein distance. In the second part of the talk, I will talk about our recent work that establishes information theoretic lower bounds on the iteration complexity of stochastic gradient-based algorithms for sampling from strongly-log concave densities. This is joint work with Yasin Abbasi-Yadkori, Peter Bartlett, Xiang Cheng, Michael Jordan and Philip Long.
Deep Affine Normalizing Flows are efficient and powerful models for high-dimensional density estimation and sample generation. Yet little is known about how they succeed in approximating complex distributions, given the seemingly limited expressiveness of individual affine layers.
In this talk, I will present the framework of Normalizing Flows for density estimation and show several recent applications like inverse problems and generative classification. Then, we take a step towards theoretical understanding by analyzing the behaviour of a single affine coupling layer under maximum likelihood loss. Such a layer estimates and normalizes conditional moments of the data distribution. One can derive a tight lower bound on the loss depending on the orthogonal transformation of the data before the affine coupling. This bound can be used to identify the optimal orthogonal transform, yielding a layer-wise training algorithm for deep affine flows.
We initiate the study of a new set of empirical properties of interpolating classifiers, including neural networks, kernel machines, and decision trees. These properties constitute a new type of generalization, which extends the classical notion of generalization. This "Distribution Generalization" property roughly states that the test and train outputs of interpolating classifiers are close *as distributions*, as opposed to close in just their error. This captures, for example, the following behavior— If we mislabel 30 percent of dogs as cats in the train set of CIFAR-10, then a ResNet trained to zero train error will in fact mislabel roughly 30 percent of dogs as cats on the test set as well, while leaving other classes unaffected. We generalize this observation via several formal conjectures, which we test across realistic domains. These properties form a more fine-grained understanding of interpolating classifiers, beyond just their test error, and lay an alternate framework for reasoning about their generalization.
Joint work with Yamini Bansal.
Modern deep learning has popularized the use of very large neural networks, but the theoretical tools to study such networks are still lacking. The Neural Tangent Kernel (NTK) describes how the output neutrons evolve during training allowing a precise description of the convergence and generalization of DNNs. In the infinite width limit (when the number of hidden neutrons grows to infinity) the NTK converges to a deterministic and fixed limit ensuring convergence to a global minimum. In particular, training a DNN with the MSE loss corresponds to doing Kernel Ridge Regression (KRR) with the NTK. Under the assumption of i.i.d input samples, the risk of KRR can be approximated using the Signal Capture Threshold (SCT), which identifies which principal components of the signal are learned. A further approximation leads to the Kernel Alignement Risk Estimator (KARE), which predicts the test error of KRR from the train data only.
We highlight a property of feedforward neural networks trained with SGD for classification tasks. An SGD update for a misclassified data point decreases the Frobenius norm of the weight matrix of each layer. This holds for networks of any architecture and activations such as ReLU and Leaky ReLU.This may explain why in practice the performance of neural networks can be similar with or without regularization. We then use this insight to study cases for which a ReLU network is presented with random or noisy labels. On the one hand, using completely random labels eliminate neurons and can nullify the network completely. On the other hand, with some label noise, we observe cases with considerable improvement in performance. That is, some label noise can decrease the test error of a network in an artificial example. Also, in an experiment with the MNIST dataset, we show that training with noisy labels performs comparably to standard training and yields a sparse over-complete basis for MNIST.
In this talk, we propose an analysis of gradient descent on wide two-layer ReLU neural networks that leads to sharp characterizations of the learned predictor and strong generalization performances. The main idea is to study the dynamics when the width of the hidden layer goes to infinity, which is a Wasserstein gradient flow. While this dynamics evolves on a non-convex landscape, we show that its limit is a global minimizer if initialized properly. We also study the "implicit bias" of this algorithm when the objective is the unregularized logistic loss. We finally discuss what these results tell us on the generalization performance.
This is based on joint work with Francis Bach.
I will talk about the result on the equivalence between the over-parameterized neural network and a new kernel, Neural Tangent Kernel. This equivalence implies two surprising phenomena: 1) the simple algorithm gradient descent provably finds the global optimum of the highly non-convex empirical risk, and 2) the learned neural network generalizes well despite being highly over-parameterized. I will also present empirical results showing Neural Tangent Kernel is a strong predictor.
Contrary to classical bias/variance trade-offs, deep learning practitioners have observed that vastly overparameterized neural networks with the capacity to fit virtually any labels nevertheless generalize well when trained on real data. One possible explanation of this phenomenon is that complexity control is being achieved by implicitly or explicitly controlling the magnitude of the weights of the network. This raises the question: What functions are well-approximated by neural networks whose weights are bounded in norm? In this talk, I will give some partial answers to this question. In particular, I give a precise characterization of the space of functions realizable as a two-layer (i.e., one hidden layer) neural network with ReLU activations having an unbounded number of units, but where the Euclidean norm of the weights in the network remains bounded. Surprisingly, this characterization is naturally posed in terms of the Radon transform as used in computational imaging, and I will show how tools from Radon transform analysis yield novel insights about learning with two and three-layer ReLU networks.
Reducing the memory footprint of machine learning models is an important goal, in order to make them amenable for embedded systems, mobile applications, and edge computing, as well as reducing their energy consumption. The classical approach is a typical pruning-quantization-coding pipeline, where pruning and quantization can be seen as heuristics to reduce the entropy of a deterministic weight vector, and for coding, Shannon-style schemes are used. In this talk, I present our recent work on a novel coding scheme -- Minimal Random Code Learning (MIRACLE) -- based on a variational approach and the classical bits-back argument. Rather than interpreting the model weights as a deterministic sequence, we devise an algorithm which draws a sample from the trained variational distribution, whose coding length directly corresponds to the Kullback-Leibler term in the variational objective. This allows us to explicitly control the compression rate, while optimizing the expected loss on the training set. Our method sets new state-of-the-art in neural network compression, as it strictly dominates previous approaches in a Pareto sense: On the benchmarks LeNet-5/MNIST and VGG-16/CIFAR-10, our approach yields the best test performance for a fixed memory budget, and vice versa, it achieves the highest compression rates for a fixed test performance.
Solomonoff's general theory of inference and the Minimum Description Length principle formalize Occam's razor, and hold that a good model of data is a model that is good at losslessly compressing the data, including the cost of describing the model itself. This theory gives a very interesting viewpoint on many known results in statistics and machine learning. But the success of Deep Learning seems to go against this theory: while deep neural networks are often the best models in practice, they are also extremely complex, in the sense that they are hard to compress. We solve this paradox and demonstrate experimentally the ability of deep neural networks to compress the training data even when accounting for parameter encoding.
In this talk, I will introduce a phenomenon called "the spectral bias”, which shows that even though neural networks are quite capable of fitting random target functions with perfect performance, they are biased towards learning the “simpler” components of the target function earlier in the training. Precisely, we inspect the learning process through the lens of Fourier analysis to find that the lower frequencies of the target function are learned first, even when they are underrepresented in the spectrum of the latter. This observation leads to a clearer picture of the kinds of label noise neural networks are particularly vulnerable against, and to why early stopping is as effective as it is. Finally, we also look at the curious interplay between spectral bias and the shape of the data-manifold, which might explain the unreasonable effectiveness of positional embeddings (also known as Fourier features) in the implicit representation learning literature (i.e. Neural Radiance Fields (NeRF) and friends).
The universal approximation theorem says that a standard feedforward neural network with one hidden layer is able to uniformly approximate any continuous multivariate function f to any given approximation threshold ε, provided that the activation function is continuous and non-polynomial. In this talk, we shall give a quantitative refinement of the theorem via a direct algebraic approach. In particular, when f is polynomial, we give an explicit finite upper bound (independent of ε) for the number of hidden units required. We shall discuss how ideas from algebraic geometry, algebraic combinatorics and approximation theory are combined in our algebraic proof.
Cross-validation is the most widely used method for risk estimation in machine learning and statistics. However, analyzing it and comparing it to the data splitting estimator has proved difficult. In the first part of the talk, I will present a new analysis which characterizes the exact asymptotic of cross-validation in the form of a central limit theorem for estimators which satisfy certain stability conditions. In particular, parametric estimators automatically satisfy these conditions, and the theorems characterize the cross-validated risk for such estimators fully. I will demonstrate that they exhibit a wide variety of behaviours: in the case of a parametric empirical risk minimizer, the folds behave as if independent if the evaluation loss is the same as the training loss. However, if a surrogate loss is used, different behaviours may occur. In the second part, I will move on to discuss issues which arise when using cross-validation for high-dimensional estimators: in the regime where the number of parameters is comparable to the number of observations, cross-validation (and data splitting) may introduce serious bias in the estimate of the risk when the amount of data left out is high (i.e. the number of folds is low). A natural approach may thus be to alleviate this problem by leaving out as little data as possible: a single observation, leading to leave-one-out cross-validation (LOOCV). I will show that indeed, such a result holds and the LOOCV estimator is consistent in the high-dimensional asymptotic. Unfortunately, the LOOCV estimator is computationally prohibitive, and cannot be used in practice. Finally, I will discuss a general framework, approximate LOOCV, from which closed-formed approximate estimators can be derived for penalized GLMs, including non-smooth ones such as the LASSO or SVMs.
References: Asymptotics of cross-validation, M. Austern and WZ, preprint, https://arxiv.org/abs/2001.11111 Error bounds in estimating the out-of-sample prediction error using leave-one-out cross validation in high-dimensions, KR Rad, WZ, A. Maleki, AISTATS 2020, https://arxiv.org/abs/2003.01770 Approximate Leave-One-Out for Fast Parameter Tuning in High Dimensions, S. Wang, WZ, H. Lu, V. Mirrokni, A. Maleki, ICML 2018, https://arxiv.org/abs/1810.02716
Not all neural network architectures are created equal, some perform much better than others for certain tasks. But how important are the weight parameters of a neural network compared to its architecture? In this work, we question to what extent neural network architectures alone, without learning any weight parameters, can encode solutions for a given task. We propose a search method for neural network architectures that can already perform a task without any explicit weight training. To evaluate these networks, we populate the connections with a single shared weight parameter sampled from a uniform random distribution, and measure the expected performance. We demonstrate that our method can find minimal neural network architectures that can perform several reinforcement learning tasks without weight training. On a supervised learning domain, we find network architectures that achieve much higher than chance accuracy on MNIST using random weights.
What are the fundamental quantities to understand the learning process of a deep neural network? Why are some datasets easier than others? What does it mean for two tasks to have a similar structure? We argue that information theoretic quantities, and in particular the amount of information that SGD stores in the weights, can be used to characterize the training process of a deep network. In fact, we show that the information in the weights bounds the generalization error and the invariance of the learned representation. It also allows us to connect the learning dynamics with the "structure function" of the dataset, and to define a notion of distance between tasks, which relates to fine-tuning. The non-trivial dynamics of information during training give rise to phenomena, such as critical periods for learning, that closely mimic those observed in humans and may suggest that forgetting information about the training data is a necessary part of the learning process.
Algorithmic regularization provides deep learning models with capacity control that helps them generalize. In this talk, we focus on understanding such capacity control due to dropout training in various machine learning models including deep linear networks, matrix sensing, and two-layer ReLU networks. In particular, by characterizing the regularizer induced by dropout training, we give concrete generalization error bounds for the dropout training in these models.
We study the convergence of gradient flows related to learning deep linear neural networks (where the activation function is the identity map) from data. In this case, the composition of the network layers amounts to simply multiplying the weight matrices of all layers together, resulting in an overparameterized problem. The gradient flow with respect to these factors can (for suitable initializations) be re-interpreted as a Riemannian gradient flow on the manifold of rank-r matrices endowed with a suitable Riemannian metric. We show that the flow always converges to a critical point of the underlying functional. Moreover, we establish that, for almost all initializations, the flow converges to a global minimum on the manifold of rank k matrices for some k less or equal to r.
This is joint work with Bubacarr Bah, Holger Rauhut, and Michael Westdickenberg.
We recently proposed the "Lottery Ticket Hypothesis", which conjectures that the dense neural networks we typically train have much smaller subnetworks capable of training in isolation to the same accuracy starting from the original initialization. This hypothesis raises questions about the nature of overparameterization and the importance of initialization for training neural networks in practice. In this talk, I will discuss existing work and the latest developments on the "Lottery Ticket Hypothesis," including the empirical evidence for these claims on small vision tasks, changes necessary to scale these ideas to ImageNet, and the relationship between these subnetworks and their "stability" to the noise of stochastic gradient descent. This research is entirely empirical, although it has exciting implications for theory. (This is joint work with Gintare Karolina Dziugaite, Daniel M. Roy, Alex Renda, and Michael Carbin.)
Bio: Jonathan Frankle is a fourth year PhD student at MIT, where he studies empirical deep learning. His current research focus is on the properties of sparse neural networks that allow them to train effectively as embodied by his proposed "Lottery Ticket Hypothesis" (ICLR 2019 best paper award). Jonathan also has an interest in technology policy: he works closely with lawyers, journalists, and policymakers on topics in AI policy, and he teaches at the Georgetown University Law Center. He earned his BSE and MSE in computer science at Princeton University and has previously spent time at Facebook AI Research, Google Brain, and Microsoft.
Neural network training is usually accomplished by solving a non-convex optimization problem using stochastic gradient descent. Although one optimizes over the networks parameters, the loss function (up to regularization terms) generally only depends on the realization of the neural network, i.e. the function it computes. We discuss how studying the optimization problem over the space of realizations may open up new ways to understand neural network training, if one manages to overcome the difficulties caused by the redundancies and degeneracies of how neural networks are parametrized.
In this talk, we establish a rate of convergence to minima for the stochastic gradient descent method in the case of an objective function that is not necessarily globally, or locally, convex nor globally attracting. The analysis therefore relies on the use of mini-batches in a quantitative way to control the loss of iterates to non-attracting regions. We furthermore do not assume that the critical points of the objective function are nondegenerate, which allows to treat the type degeneracies observed practically in the optimization of certain neural networks.
A fundamental goal in the theory of deep learning is to explain why the optimization of the loss function of a neural network does not seem to be affected by the presence of non-global local minima. Even in the case of linear networks, most of the existing literature paints a purely analytical picture of the loss, and provides no explanation as to *why* such architectures exhibit no bad local minima. We explain the intrinsic geometric reasons for this behavior of linear networks.
For neural networks in general, we discuss the neuromanifold, i.e., the space of functions parameterized by a network with a fixed architecture. For instance, the neuromanifold of a linear network is a determinantal variety, a classical object of study in algebraic geometry. We introduce a natural distinction between pure critical points, which only depend on the neuromanifold, and spurious critical points, which arise from the parameterization.
This talk is based on joint work with Matthew Trager and Joan Bruna.
What are possible properties of the loss function and how they might change with the size of the network are among the fundamental questions that one could study to improve our understanding of deep learning. In this talk, I will present some results that we have obtained along this direction, including a topological analysis of sub-level sets, critical points with global optimality, and then show how to leverage this insight to get a convergence result for gradient descent.
Recently, progress has been made in the application of neural networks to the numerical analysis of partial differential equations (PDEs). In the latter the variational formulation of the Poisson problem is used in order to obtain an objective function - a regularised Dirichlet energy - that was used for the optimisation of some neural networks. In this notes we use the notion of Γ-convergence to show that ReLU networks of growing architecture that are trained with respect to suitably regularised Dirichlet energies converge to the true solution of the Poisson problem. We discuss how this approach generalises to arbitrary variational problems under certain universality assumptions of neural networks and see that this covers some nonlinear stationary PDEs like the p-Laplace.
As Wasserstein Generative Adversial Networks (WGANs) were introduced, optimal transport emerged as a toolkit to be used to devise loss functions for learning probabilistic models in machine learning. Especially the 1-Wasserstein distance is a popular choice as such a loss function, expressed in its dual Kantorovich form, where one is required to variationally maximize the sum of expectations over data of two constrained target functions. Different heurestics are applied to enforce the constraints of being 1-Lipschitz, such as the weight clipping or gradient penalty methods. In this talk, we wish to address how well these methods are estimating the 1-Wasserstein distance, and into which directions we could look to perhaps do better.
Many machine learning problems can be expressed as the optimization of some cost functional over a parametric family of probability distributions. It is often beneficial to solve such optimization problems using natural gradient methods. These methods are invariant to the parametrization of the family, and thus can yield more effective optimization. Unfortunately, computing the natural gradient is challenging as it requires inverting a high dimensional matrix at each iteration. We propose a general framework to approximate the natural gradient for the Wasserstein metric, by leveraging a dual formulation of the metric restricted to a Reproducing Kernel Hilbert Space. Our approach leads to an estimator for gradient direction that can trade-off accuracy and computational cost, with theoretical guarantees. We verify its accuracy on simple examples, and show the advantage of using such an estimator in classification tasks on Cifar10 and Cifar100 empirically.
https://arxiv.org/abs/1910.09652