Constrained least squares are fundamental in applied math, statistics and many other disciplines. The memory and computation costs are expensive in practice due to high-dimensional input data. We apply the so-called "sketching" strategy to project the problem onto a space of a much lower dimension while maintaining the approximation accuracy. Tensor structure often appears in the data matrices of optimization problems. In this talk, we will discuss: 1. Sketching designs for optimizations that have special tensor format. 2. Theoretical guarantees of tensor sketching for optimizations with general constrained sets, for instance, linear regression and sparse recovery. The sketching dimension depends on a statistical complexity that characterizes the geometry of the underlying problem.

Machine learning and information theory tasks are in some sense equivalent since both involve identifying patterns and regularities in data. To recognize an elephant, a child (or a neural network) observes the repeating pattern of big ears, a trunk, and grey skin. To compress a book, a compression algorithm searches for highly repeating letters or words. So the high-level question that guided this research is:
When is learning equivalent to compression?
We use the quantity I(input; output), the mutual information between the training data and the output hypothesis of the learning algorithm, to measure the compression of the algorithm. Under this information-theoretic setting, these two notions are indeed equivalent.
a) Compression implies learning. We will show that learning algorithms that retain a small amount of information from their input sample generalize.
b) Learning implies compression. We will show that under an average-case analysis, every hypothesis class of finite VC dimension (a characterization of learnable classes) has empirical risk minimizers (ERM) that do not reveal a lot of information.
If time permits, we will discuss a worst-case lower bound we proved by presenting a simple concept class for which every empirical risk minimizer (also randomized) must reveal a lot of information and also observe connections with well-studied notions such as sample compression schemes, Occam's razor, PAC-Bayes, and differential privacy.

Graph representation learning has many real-world applications, from image processing, facial recognition, drug repurposing, social networks analysis to protein classification. For representation learning on graph, effective representation of graph data is crucial to learning performance. In this paper, we propose a new multiscale representation system for graph data --- decimated framelets. The framelets form a localized, tight framelet system on the graph via a data-driven filter bank. We establish discrete framelet transforms to decompose and reconstruct data on the graph under the decimated framelets. The graph framelets are built on a chain-based orthonormal basis which has fast graph Fourier transforms. From this, we then give a fast algorithm for the discrete framelet transforms on the graph --- {\fgt}, which has computational cost $\bigo{}{N\log N}$ for the graph with size $N$. Numerical examples verify the theory of decimated framelets and {\fgt}. Application to real-world examples, such as multiresolution analysis for traffic network, node classification by graph neural networks, illustrate that decimated framelets provide a useful graph data representation and {\fgt} is critical to improving computing performance.
This is joint work with Xuebin Zheng, Bingxin Zhou and Xiaosheng Zhuang.

The success of deep neural networks is in part due to the use of normalization layers. Normalization layers like Batch Normalization, Layer Normalization and Weight Normalization are ubiquitous in practice, as they improve generalization performance and speed up training significantly. Nonetheless, the vast majority of current deep learning theory and non-convex optimization literature focuses on the un-normalized setting, where the functions under consideration do not exhibit the properties of commonly normalized neural networks. In this paper, we bridge this gap by giving the first global convergence result for two-layer neural networks with ReLU activations trained with a normalization layer, namely Weight Normalization. Our analysis shows how the introduction of normalization layers changes the optimization landscape and can enable faster convergence as compared with un-normalized neural networks.
This is joint work with Quanquan Gu and Guido Montufar.

Residual networks (ResNets) are a deep learning architecture that substantially improved the state of the art performance in certain supervised learning tasks. Since then, they have received continuously growing attention. ResNets have a recursive structure x_{k+1} = x_k + R_k(x_k) where R_k is a neural network called a residual block. This structure can be seen as the Euler discretisation of an associated ordinary differential equation (ODE) which is called a neural ODE. Recently, ResNets were proposed as the space-time approximation of ODEs which are not of this neural type. To elaborate this connection we show that by increasing the number of residual blocks as well as their expressivity the solution of an arbitrary ODE can be approximated in space and time simultaneously by deep ReLU ResNets. Further, we derive estimates on the complexity of the residual blocks required to obtain a prescribed accuracy under certain regularity assumptions.