Graph Neural Networks (GNNs) have become a popular tool for learning representations of graph-structured inputs, with applications in computational chemistry, recommendation, pharmacy, reasoning, and many other areas. This talk will show recent results on representational power and learning in GNNs. First, we will address representational power and important limitations of popular message passing networks and of some recent extensions of these models. Second, we consider learning, and provide new generalization bounds for GNNs. Third, although many networks may be able to represent a task, some architectures learn it better than others. I will show results that connect the architectural structure and learning behavior, in and out of the training distribution.
This talk is based on joint work with Keyulu Xu, Behrooz Tahmasebi, Jingling Li, Mozhi Zhang, Simon S. Du, Ken-ichi Kawarabayashi, Vikas Garg and Tommi Jaakkola.
In classical General Relativity, the values of fields on spacetime are uniquely determined by their values on an initial time slice within its `domain of dependence'. However, it may occur that the spacetime under consideration extends beyond this domain, and fields, therefore, are not entirely determined by their initial data. In fact, such a naive failure of determinism occurs inside all rotating or charged black holes.
The boundary of the region determined by the initial data is called the ``Cauchy horizon'. The famous `strong cosmic censorship conjecture' asserts that the Cauchy horizon does not, in fact, exist in practice because the slightest perturbation (of the geometry itself or the matter fields) will become singular there converting the Cauchy horizon to a final singularity.
Recently, however, it has been found that classically this is not the case inside certain charged black holes in deSitter space provided the mass, charge, and cosmological constant are in a certain regime. In this colloquium I ask wheter quantum effects will come to the rescue of strong cosmic censorship.
Complexity of natural, artificial and social systems can be efficiently studied by using concepts of statistical mechanics, very especially of its modern version — nonextensive statistical mechanics — based on nonadditive entropies, which generalize the celebrated Boltzmann-Gibbs-Shannon-von Neumann entropy. A brief introduction to the physical motivations of this generalization, its connections with generalized forms of the Central Limit Theorem and of the Large Deviation Theory, and some illustrative applications in various areas of knowledge will be presented. An updated bibliography is available at http://tsallis.cat.cbpf.br/biblio.htm
Training a neural network is a non-convex problem that exhibits spurious and disconnected local minima. Yet, in practice neural networks with millions of parameters are successfully optimized using gradient descent methods. In this talk, I will give some theoretical insights on why this is possible. First, I will show that the combination of stochastic gradient descent and over-parameterization makes the landscape of deep networks approximately connected and, therefore, more favorable to optimization. Then, I will focus on a special case (two-layer network fitting a convex function) and provide a quantitative convergence result by exploiting the displacement convexity of a related Wasserstein gradient flow. Finally, I will go back to deep networks and show that a single wide layer followed by a pyramidal topology suffices to guarantee the global convergence of gradient descent.
[Based on joint work with Adel Javanmard, Andrea Montanari, Quynh Nguyen, and Alexander Shevchenko]
We study a class of interacting particle systems that may be used for optimization. By considering the mean-field limit one obtains a nonlinear Fokker-Planck equation. This equation exhibits a gradient structure in probability space, based on a modified Wasserstein distance which reflects particle correlations: the Kalman-Wasserstein metric. This setting gives rise to a methodology for calibrating and quantifying uncertainty for parameters appearing in complex computer models which are expensive to run, and cannot readily be differentiated. This is achieved by connecting the interacting particle system to ensemble Kalman methods for inverse problems. This is joint work with Alfredo Garbuno-Inigo (Caltech), Wuchen Li (UCLA) and Andrew Stuart (Caltech).
A fundamental goal in the theory of deep learning is to explain why the optimization of the loss function of a neural network does not seem to be affected by the presence of non-global local minima. Even in the case of linear networks, the existing literature paints a purely analytical picture of the loss, and provides no explanation as to *why* such architectures exhibit no bad local minima. We explain the intrinsic geometric reasons for this behavior of linear networks.
For neural networks in general, we discuss the neuromanifold, i.e., the space of functions parameterized by a network with a fixed architecture. For instance, the neuromanifold of a linear network is a determinantal variety, a classical object of study in algebraic geometry. We introduce a natural distinction between pure critical points, which only depend on the neuromanifold, and spurious critical points, which arise from the parameterization.
This talk is based on joint work with Matthew Trager and Joan Bruna.
Although originally biologically inspired neural networks were introduced as multilayer computational models, shallow networks have been dominant in applications till the recent renewal of interest in deep architectures. Experimental evidence and successful applications of deep networks pose theoretical questions asking: When and why are deep networks better than shallow ones?
This lecture will present some probabilistic and constructive results showing limitations of shallow networks. It will show how geometrical properties of high-dimensional spaces imply probabilistic lower bounds on network complexity. Bounds depending on covering numbers of dictionaries of computational units will be derived and combined with estimates of sizes of some common dictionaries used in neurocomputing. Probabilistic results will be complemented by constructive ones built using Hadamard matrices and pseudo-noise sequences. The results will be illustrated by an example of a class of functions which can be computed by two-hidden-layer perceptron networks of considerably smaller model complexities than by networks with only one hidden layer. Connections with the No Free Lunch Theorem and the central paradox of coding theory will be discussed.
In this talk we will discuss how known results from two areas of research -- supervised learning theory and numerical integration -- can be used in sampling discretization of the square norm on different function classes.
In this talk I will explore from different perspectives the duality between geometry and data, and mostly focus on two of them. First, ideas from geometry and analysis have inspired a variety of popular procedures for supervised and semi supervised learning tasks, an example of which is the so called spectral clustering algorithm, where for a given data set $\mathit{X}=\{x_1, \dots, x_n\}$ with a weighted graph structure $\Gamma= (\mathit{X},W)$ on $\mathit{X}$, one uses the spectrum of an associated graph Laplacian to find a meaningful partition of the data set. Conversely, one can ask whether such ML algorithms are more concretely linked with the geometric ideas that inspired them: If the data set consists of samples from a distribution supported on a manifold (or at least approximately so), and the weights depend inversely on the distance between the points, do these procedures converge as the number of samples goes to infinity towards analogous procedures at the continuum level?. I will explore this question mathematically using ideas from the calculus of variations, PDE theory, and optimal transport. Then, I will discuss a second perspective: what is the impact of the "choice of geometry" in learning, and how, in the presence of labeled data, this can be related to the inverse problem of learning geometry from data?
Optimal transport (OT) has become a fundamental mathematical tool at the interface between calculus of variations, partial differential equations and probability. It took however much more time for this notion to become mainstream in numerical applications. This situation is in large part due to the high computational cost of the underlying optimization problems. There is a recent wave of activity on the use of OT-related methods in fields as diverse as computer vision, computer graphics, statistical inference, machine learning and image processing. In this talk, I will review an emerging class of numerical approaches for the approximate resolution of OT-based optimization problems. This offers a new perspective for the application of OT in imaging sciences (to perform color transfer or shape and texture morphing) and machine learning (to perform clustering, classification and generative models in deep learning). More information and references can be found on the website of our book "Computational Optimal Transport" https://optimaltransport.github.io/
Compressed sensing (CS) is a new sampling theory and has become a major research direction in applied mathematics in the last 10 years. The key idea of CS for addressing the big data problem is to avoid sampling data that can be recovered afterwards. However, mathematical recovery guarantees depend on assumptions that are often too strong in practice. The extension of the mathematical theory as well as the development of new applications in various fields are the subject of many current research activities in the field. The talk will highlight some of the challenges of bridging the gap between theory and practicality of CS.
Blind deconvolution problems are ubiquitous in many areas of imaging and technology and have been the object of study for several decades. Recently, inspired by the theory of compressed sensing, a new viewpoint has been introduced, motivated by applications in wireless application, where a signal is transmitted through an unknown channel. Namely, the idea is to randomly embed the signal into a higher dimensional space before transmission. Due to the resulting redundancy, one can hope to recover both the signal and the channel parameters. In this talk we will discuss recovery guarantees for this problem. In this talk, we will focus on convex approaches based on lifting as they have first been studied by Ahmed et al. (2014). We show that one encounters a fundamentally different geometric behavior as compared to generic bilinear measurements. In addition, we will review recent progress on the study of efficient nonconvex recovery methods. This talk is based on joint work with Dominik Stöger (TUM).
Estimating covariances between financial assets plays an important role in risk management and portfolio allocation. Here, we show that from a Bayesian perspective market factor models, such as the famous CAPM, can be understood as linear latent space embeddings. Based on this insight, we consider extensions allowing for non-linear embeddings via Gaussian processes and discuss some applications.
In general, all these models are unidentified as the choice of coordinate frame for the latent space is arbitrary. To remove this symmetry we reparameterize the factor loadings in terms of an orthogonal frame and its singular values and provide an efficient implementation based on Householder transformations. Finally, relying on results from random matrix theory we derive the parameter distribution corresponding to a Gaussian prior on the factor loadings.
In this talk, which is based on joint work with Benjamin Gess and Arnulf Jentzen, we will discuss convergence rates for mean field stochastic gradient descent algorithms. Such algorithms play an important role in machine learning and deep learning applications, where the goal of the algorithm is to minimize a deterministic potential, such as that arising in the optimization of an artificial neural network. We do not assume that the dynamics associated to the deterministic gradient descent of the potential are globally attracting to the minimum, nor do we assume that the critical points of the potential are nondegenerate. This allows for the type degeneracies observed practically in the optimization of certain neural networks. We will furthermore show informally that the computational efficiency of the algorithm is nearly optimal in the limit that the learning rate approaches one.
This is a topical survey talk about certain canonical dynamical systems on the space of proabability measures when considered as an infinite dimensional manifold equipped with Otto's Riemannian structure for optimal transportation. Apart from classical gradient flows we will discuss Hamiltonian systems and connections to Quantum mechanics as well as candidate processes for what can be considered Brownian motion associated to the corresponding infinite dimensional Laplace-Beltrami operator.
Many problems in signal/image processing, and computer vision amount to estimating a signal, image, or tri-dimensional structure/scene from corrupted measurements. A particularly challenging form of measurement corruption are latent transformations of the underlying signal to be recovered. Many such transformations can be described as a group acting on the object to be recovered. Examples include the Simulatenous Localization and Mapping (SLaM) problem in Robotics and Computer Vision, where pictures of a scene are obtained from different positions and orientations; Cryo-Electron Microscopy (Cryo-EM) imaging where projections of a molecule density are taken from unknown rotations, and several others.
One fundamental example of this type of problems is Multi-Reference Alignment: Given a group acting in a space, the goal is to estimate an orbit of the group action from noisy samples. For example, in one of its simplest forms, one is tasked with estimating a signal from noisy cyclically shifted copies. We will show that the number of observations needed by any method has a surprising dependency on the signal-to-noise ratio (SNR), and algebraic properties of the underlying group action. Remarkably, in some important cases, this sample complexity is achieved with computationally efficient methods based on computing invariants under the group of transformations.
The sequence of moments characterizes the law of (sufficiently nice) random variables in finite-dimensional spaces. I will talk about an analogous result for path-valued random variables, that is stochastic processes, and develop applications in statistical learning. Our starting point is the so-called signature of a path, that identifies a path as an element in a tensor algebra. (Joint work with Ilya Chevyrev).
Tensors are higher dimensional analogues of matrices; they are used to record data with multiple changing variables. Interpreting tensor data requires finding low rank structure, and the structure depends on the application or context. In this talk, we describe four projects in the study of structured tensors. Often tensors of interest define semi-algebraic sets, given by polynomial equations and inequalities. We give a characterization of the set of tensors of real rank two, and answer questions about statistical models using probability tensors and semi-algebraic statistics. We also study cubic surfaces as symmetric tensors, and describe current work on learning a path from its three dimensional signature tensor. This talk is based on joint work with Guido Montúfar and Bernd Sturmfels.
We study Bayesian networks based on max-linear structural equations as introduced by Gissibl and Klüppelberg (2015) and provide a summary of their independence properties. In particular we emphasize that distributions for such networks are never faithful to the independence model determined by their associated directed acyclic graph unless the latter is a polytree, in which case they are always faithful. In addition, we consider some of the basic issues of estimation and discuss generalized maximum likelihood estimation of the coefficients, using the concept of a generalized likelihood ratio for non-dominated families as introduced by Kiefer and Wolfowitz. Particular emphasis will be placed on the use of max-times algebra in the formulation and analysis of such models. The lecture is based on joint work with Gissibl and Klüppelberg.
Given the dramatic successes in machine learning over the past half decade, there has been a resurgence of interest in applying learning techniques to continuous control problems in robotics, self-driving cars, and unmanned aerial vehicles. Though such applications appear to be straightforward generalizations of what is known as reinforcement learning, few fundamental baselines have been established prescribing how well one must know a system in order to control it. In this talk, I will discuss how one might merge techniques from statistical learning theory with robust control to derive baselines for such continuous control. I will further describe how these simple baselines give us insights into shortcomings of existing reinforcement learning methodology. I will close by listing several exciting open problems that must be solved before we can build robust, safe learning systems that interact with an uncertain physical environment.
Joint work with Sarah Dean, Aurelia Guy, Horia Mania, Nikolai Matni, and Stephen Tu.