The Bias of Using Cross-Validation in Genomic Predictions and Its Correction

Qian, Yanzhao

The Bias of Using Cross-Validation in Genomic Predictions and Its Correction

dc.contributor.advisor	Greenberg, Matthew
dc.contributor.advisor	Long, Quan
dc.contributor.author	Qian, Yanzhao
dc.contributor.committeemember	Shen, Hua
dc.contributor.committeemember	MacDonald, Matthew Ethan
dc.date.accessioned	2023-09-28T17:37:51Z
dc.date.available	2023-09-28T17:37:51Z
dc.date.issued	2023-09-21
dc.description.abstract	Cross-validation (CV) is a widely used technique in statistical learning for model evaluation and selection. Meanwhile, various of statistical learning methods, such as Generalized Least Square (GLS), Linear Mixed-Effects Models (LMM), and regularization methods are commonly used in genomic predictions, a field that utilizes DNA polymorphisms to predict phenotypic traits. However, due to high dimensionality, relatively small sample sizes, and data sparsity in genomic data, CV in these scenarios may lead to an underestimation of the generalization error. In this work, we analyzed the bias of CV in eight methods: Ordinary Least Square (OLS), GLS, LMM, Lasso, Ridge, elastic-net (ENET), and two hybrid methods: one combining GLS with Ridge regularization (GLS+Ridge), and the other combining LMM with Ridge regularization (LMM+Ridge). Leveraging genomics data from the 1,000 Genomes Project and simulated phenotypes, our investigation revealed the presence of bias in all these methods. To address this bias, we adapted a variance-structure method known as Cross-Validation Correction (CVc). This approach aims to rectify the cross-validation error by providing a more accurate estimate of the generalization error. To quantify the performance of our adapted CVc towards all these methods, we applied the trained model to an independently generated dataset, which served as a gold standard for validating the models and calculating the generalization error. The outcomes show that, by leveraging CVc, we corrected the CV bias for most of the methods mentioned above, with two exceptions that are unrectifiable methods: ENET and Lasso. Our work revealed the substantial bias in the use of CV in genomics, a phenomenon under-appreciated by the field of statistical genomics and medicine. Additionally, we demonstrated that bias-corrected models may be formed by adapting CVc, although more work is needed to cover the full spectrum.
dc.identifier.citation	Qian, Y. (2023). The bias of using Cross-Validation in genomic predictions and its correction (Master's thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca.
dc.identifier.uri	https://hdl.handle.net/1880/117208
dc.identifier.uri	https://doi.org/10.11575/PRISM/42050
dc.language.iso	en
dc.publisher.faculty	Graduate Studies
dc.publisher.institution	University of Calgary
dc.rights	University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission.
dc.subject	Cross-Validation
dc.subject	Bias Correction
dc.subject	Genomic Prediction
dc.subject.classification	Statistics
dc.title	The Bias of Using Cross-Validation in Genomic Predictions and Its Correction
dc.type	master thesis
thesis.degree.discipline	Mathematics & Statistics
thesis.degree.grantor	University of Calgary
thesis.degree.name	Master of Science (MSc)
ucalgary.thesis.accesssetbystudent	I do not require a thesis withhold – my thesis will have open access and can be viewed and downloaded publicly as soon as possible.