A Statistical Theory of Overfitting for Imbalanced Classification

Jan 1, 2026·

Jingyang Lyu

Kangjie Zhou

Yiqiao Zhong

· 0 min read

Cite Code Poster Slides arXiv OpenReview Blog

Empirical Logit Distribution (ELD) and Testing Logit Distribution (TLD)

Abstract

Classification with imbalanced data is a common challenge in machine learning, where minority classes form only a small fraction of the training samples. Classical theory, relying on large-sample asymptotics and finite-sample corrections, is often ineffective in high dimensions, leaving many overfitting phenomena unexplained. In this paper, we develop a statistical theory for high-dimensional imbalanced linear classification, showing that dimensionality induces truncation or skewing effects on the logit distribution, which we characterize via a variational problem. For linearly separable Gaussian mixtures, logits follow $\mathsf{N}(0, 1)$ on the test set but converge to $\max\{\kappa, \mathsf{N}(0, 1)\}$ on the training set—a pervasive phenomenon we confirm on tabular, image, and text data. This phenomenon explains why the minority class is more severely affected by overfitting. We further show that margin rebalancing mitigates minority accuracy drop and provide theoretical insights into calibration and uncertainty quantification.

Type

Journal article

Publication

The Fourteenth International Conference on Learning Representations

Last updated on Jan 1, 2026

Statistical Foundation of Deep Learning

Authors

Jingyang Lyu

Ph.D. student

Refining Covariance Matrix Estimation in Stochastic Gradient Descent through Bias Reduction Jan 1, 2026 →