Towards Understanding and Advancing the Generalization of Adam in Deep Learning
- Yuan Cao (University of California, LA)
Adaptive gradient methods such as Adam have gained increasing popularity in deep learning optimization. However, it has been observed that compared with (stochastic) gradient descent, Adam can converge to a different solution with a significantly worse test error in many deep learning applications such as image classification, even with a fine-tuned regularization. In this talk, I will discuss some recent results on the convergence and generalization of Adam in training neural networks, and give a theoretical explanation for the difference between Adam and SGD. I will also present a new deep learning optimizer called the partially adaptive momentum estimation method, which achieves faster convergence rates and smaller test errors than Adam and SGD with momentum on various deep learning tasks.
Yuan Cao is an assistant professor in the Department of Statistics and Actuarial Science and Department of Mathematics at the University of Hong Kong. Before joining HKU, he was postdoctoral scholar at UCLA working with Professor Quanquan Gu. He received his B.S. from Fudan University and Ph.D. from Princeton University. Yuan’s research interests include the theory of deep learning, non-convex optimization, and high-dimensional statsitcs. He has published research papers in top machine learning journals (ML) and conferences (NeurIPS, ICML, AAAI, IJCAI, etc.), including a spotlight presentation in NeurIPS 2019 and a long talk in ICML 2021.