Why are the logits of trained models distorted? A theory of overfitting for imbalanced classification
- Jingyang Lyu (University of Wisconsin–Madison)
Abstract
Data imbalance presents a fundamental challenge in data analysis, where certain classes (minority classes) account for a small fraction of the training data compared with other classes (majority classes). Many existing techniques attempt to compensate for the underrepresentation of minority classes, yet in empirical deep learning, the issue is exacerbated by the large model size. Despite extensive heuristics, the statistical foundations underlying these methods remain underdeveloped, raising concerns about the reliability of machine learning models. Classical statistical theory—based on large-sample asymptotics and finite-sample corrections—often falls short in high-dimensional settings, leaving many overfitting phenomena unexplained.
In this talk, I will examine the problem of imbalanced classification in high dimensions, focusing on support vector machines (SVMs) and logistic regression. I will introduce a "truncation" phenomenon—a distortion of the training data's logit distribution due to overfitting in high dimensions—which we have observed across single-cell tabular data, image data, and text data. I will provide a theoretical foundation by characterizing asymptotic distribution via a variational formulation. This analysis formalizes the intuition that overfitting disproportionately harms minority classes and demonstrates how margin rebalancing—a widely used deep learning heuristic—helps mitigate data imbalance. Consequently, this theory provides both qualitative and quantitative insights into generalization errors and uncertainty quantification, particularly in the context of model calibration.
This talk is based on a joint work with Kangjie Zhou (Columbia) and my advisor Yiqiao Zhong (UW–Madison).
Link to the paper: arxiv.org/abs/2502.11323
Reproduction code: github.com/jlyu55/Imbalanced_Classification