By Jing Lei, Carnegie Mellon University, USA, jinglei@andrew.cmu.edu
Modern data analysis and statistical learning are characterized by two defining features: complex data structures and black-box algorithms. The complexity of data structures arises from advanced data collection technologies and data-sharing infrastructures, such as imaging, remote sensing, wearable devices, and genomic sequencing. In parallel, black-box algorithms—particularly those stemming from advances in deep neural networks—have demonstrated remarkable success on modern datasets. This confluence of complex data and opaque models introduces new challenges for uncertainty quantification and statistical inference, a problem we refer to as “black-box inference”.
The difficulty of black-box inference lies in the absence of traditional parametric or nonparametric modeling assumptions, as well as the intractability of the algorithmic behavior underlying many modern estimators. These factors make it difficult to precisely characterize the sampling distribution of estimation errors. A common approach to address this issue is post-hoc randomization, which includes permutation, resampling, sample splitting, cross-validation, and noise injection. When combined with mild assumptions, such as exchangeability in the data-generating process, these methods can yield valid inference and uncertainty quantification.
Post-hoc randomization methods have a rich history, ranging from classical techniques like permutation tests, the jackknife, and the bootstrap, to more recent developments such as conformal inference. These approaches typically require minimal knowledge about the underlying data distribution or the inner workings of the estimation procedure. While originally designed for varied purposes, many of these techniques rely, either implicitly or explicitly, on the assumption that the estimation procedure behaves similarly under small perturbations to the data. This idea, now formalized under the concept of stability, has become a foundational principle in modern data science. Over the past few decades, stability has emerged as a central research focus in both statistics and machine learning, playing critical roles in areas such as generalization error, data privacy, and adaptive inference.
In this article, we investigate one of the most widely used resampling techniques for model comparison and evaluation– cross-validation (CV)–through the lens of stability. We begin by reviewing recent theoretical developments in CV for generalization error estimation and model selection under stability assumptions. We then explore more refined results concerning uncertainty quantification for CV-based risk estimates. By integrating these research directions, we uncover new theoretical insights and methodological tools. Finally, we illustrate their utility across both classical and contemporary topics, including model selection, selective inference, and conformal prediction.
Modern data analysis and statistical learning are characterized by two defining features: complex data structures and black-box algorithms. The complexity of data structures arises from advanced data collection technologies and data-sharing infrastructures, such as imaging, remote sensing, wearable devices, and genomic sequencing. In parallel, black-box algorithms—particularly those stemming from advances in deep neural networks—have demonstrated remarkable success on modern datasets. This confluence of complex data and opaque models introduces new challenges for uncertainty quantification and statistical inference, a problem we refer to as “black-box inference”. The difficulty of black-box inference lies in the absence of traditional parametric or nonparametric modeling assumptions, as well as the intractability of the algorithmic behavior underlying many modern estimators. These factors make it difficult to precisely characterize the sampling distribution of estimation errors. A common approach to address this issue is post-hoc randomization, which includes permutation, resampling, sample splitting, cross-validation, and noise injection. When combined with mild assumptions, such as exchangeability in the data-generating process, these methods can yield valid inference and uncertainty quantification. Post-hoc randomization methods have a rich history, ranging from classical techniques like permutation tests, the jackknife, and the bootstrap, to more recent developments such as conformal inference. These approaches typically require minimal knowledge about the underlying data distribution or the inner workings of the estimation procedure. While originally designed for varied purposes, many of these techniques rely, either implicitly or explicitly, on the assumption that the estimation procedure behaves similarly under small perturbations to the data. This idea, now formalized under the concept of stability, has become a foundational principle in modern data science. Over the past few decades, stability has emerged as a central research focus in both statistics and machine learning, playing critical roles in areas such as generalization error, data privacy, and adaptive inference.
In this monograph, one of the most widely used resampling techniques for model comparison and evaluation—cross-validation (CV)—through the lens of stability is investigated. Firstly, recent theoretical developments in CV for generalization error estimation and model selection under stability assumptions are reviewed. Thereafter, more refined results concerning uncertainty quantification for CV-based risk estimates are explored, and by integrating these research directions, new theoretical insights and methodological tools are uncovered. Finally, utility across both classical and contemporary topics are illustrated, including model selection, selective inference, and conformal prediction.