The Structure of Cross-Validation Error: Stability, Covariance, and Minimax Limits

Ido Nachum (University of Haifa)
Monday, November 24rd 2025, 2-3pm
Abstract:
Despite extensive theoretical work on cross-validation (CV), many foundational questions about its statistical behavior remain open. In this talk, I’ll explore how the interplay between algorithmic and distributional properties influences the optimal choice of the number of folds in k-fold CV. I’ll present a new decomposition of the mean-squared error (MSE) of CV risk estimates that explicitly accounts for correlations between overlapping folds. This framework introduces a new and weaker notion of algorithmic stability, distinct from the hypothesis stability typically assumed in prior analyses.
The main results are two complementary bounds. First, for any learning rule that minimizes empirical risk, we prove a minimax lower bound
min_k max_D E[(L̂ₖ^CV − L_D)²] = Ω(√k / n),
where n is the sample size, k is the number of folds, and D is the underlying distribution. This indicates that even under ideal conditions, CV incurs an unavoidable penalty due to fold dependence and cannot achieve the optimal 1/n rate of an independent validation set. Second, we construct learning rules that achieve
min_k max_D E[(L̂ₖ^CV − L_D)²] = Ω(k / n),
matching the accuracy of a hold-out estimator based on a single fold.
Together, these results delineate the fundamental limits of resampling-based risk estimation: CV’s minimax performance lies between the k/n and √k/n regimes, revealing an intrinsic trade-off between data reuse and estimator dependence.
Joint work with Ruediger Urbanke and Thomas Weinberger.