The Bias of Using Cross-Validation in Genomic Predictions and Its Correction

Date
2023-09-21
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Cross-validation (CV) is a widely used technique in statistical learning for model evaluation and selection. Meanwhile, various of statistical learning methods, such as Generalized Least Square (GLS), Linear Mixed-Effects Models (LMM), and regularization methods are commonly used in genomic predictions, a field that utilizes DNA polymorphisms to predict phenotypic traits. However, due to high dimensionality, relatively small sample sizes, and data sparsity in genomic data, CV in these scenarios may lead to an underestimation of the generalization error. In this work, we analyzed the bias of CV in eight methods: Ordinary Least Square (OLS), GLS, LMM, Lasso, Ridge, elastic-net (ENET), and two hybrid methods: one combining GLS with Ridge regularization (GLS+Ridge), and the other combining LMM with Ridge regularization (LMM+Ridge). Leveraging genomics data from the 1,000 Genomes Project and simulated phenotypes, our investigation revealed the presence of bias in all these methods. To address this bias, we adapted a variance-structure method known as Cross-Validation Correction (CVc). This approach aims to rectify the cross-validation error by providing a more accurate estimate of the generalization error. To quantify the performance of our adapted CVc towards all these methods, we applied the trained model to an independently generated dataset, which served as a gold standard for validating the models and calculating the generalization error. The outcomes show that, by leveraging CVc, we corrected the CV bias for most of the methods mentioned above, with two exceptions that are unrectifiable methods: ENET and Lasso. Our work revealed the substantial bias in the use of CV in genomics, a phenomenon under-appreciated by the field of statistical genomics and medicine. Additionally, we demonstrated that bias-corrected models may be formed by adapting CVc, although more work is needed to cover the full spectrum.
Description
Keywords
Cross-Validation, Bias Correction, Genomic Prediction
Citation
Qian, Y. (2023). The bias of using Cross-Validation in genomic predictions and its correction (Master's thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca.