The Bias of Using Cross-Validation in Genomic Predictions and Its Correction

dc.contributor.advisorGreenberg, Matthew
dc.contributor.advisorLong, Quan
dc.contributor.authorQian, Yanzhao
dc.contributor.committeememberShen, Hua
dc.contributor.committeememberMacDonald, Matthew Ethan
dc.date.accessioned2023-09-28T17:37:51Z
dc.date.available2023-09-28T17:37:51Z
dc.date.issued2023-09-21
dc.description.abstractCross-validation (CV) is a widely used technique in statistical learning for model evaluation and selection. Meanwhile, various of statistical learning methods, such as Generalized Least Square (GLS), Linear Mixed-Effects Models (LMM), and regularization methods are commonly used in genomic predictions, a field that utilizes DNA polymorphisms to predict phenotypic traits. However, due to high dimensionality, relatively small sample sizes, and data sparsity in genomic data, CV in these scenarios may lead to an underestimation of the generalization error. In this work, we analyzed the bias of CV in eight methods: Ordinary Least Square (OLS), GLS, LMM, Lasso, Ridge, elastic-net (ENET), and two hybrid methods: one combining GLS with Ridge regularization (GLS+Ridge), and the other combining LMM with Ridge regularization (LMM+Ridge). Leveraging genomics data from the 1,000 Genomes Project and simulated phenotypes, our investigation revealed the presence of bias in all these methods. To address this bias, we adapted a variance-structure method known as Cross-Validation Correction (CVc). This approach aims to rectify the cross-validation error by providing a more accurate estimate of the generalization error. To quantify the performance of our adapted CVc towards all these methods, we applied the trained model to an independently generated dataset, which served as a gold standard for validating the models and calculating the generalization error. The outcomes show that, by leveraging CVc, we corrected the CV bias for most of the methods mentioned above, with two exceptions that are unrectifiable methods: ENET and Lasso. Our work revealed the substantial bias in the use of CV in genomics, a phenomenon under-appreciated by the field of statistical genomics and medicine. Additionally, we demonstrated that bias-corrected models may be formed by adapting CVc, although more work is needed to cover the full spectrum.
dc.identifier.citationQian, Y. (2023). The bias of using Cross-Validation in genomic predictions and its correction (Master's thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca.
dc.identifier.urihttps://hdl.handle.net/1880/117208
dc.identifier.urihttps://doi.org/10.11575/PRISM/42050
dc.language.isoen
dc.publisher.facultyGraduate Studies
dc.publisher.institutionUniversity of Calgary
dc.rightsUniversity of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission.
dc.subjectCross-Validation
dc.subjectBias Correction
dc.subjectGenomic Prediction
dc.subject.classificationStatistics
dc.titleThe Bias of Using Cross-Validation in Genomic Predictions and Its Correction
dc.typemaster thesis
thesis.degree.disciplineMathematics & Statistics
thesis.degree.grantorUniversity of Calgary
thesis.degree.nameMaster of Science (MSc)
ucalgary.thesis.accesssetbystudentI do not require a thesis withhold – my thesis will have open access and can be viewed and downloaded publicly as soon as possible.
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
ucalgary_2023_qian_yanzhao.pdf
Size:
1.64 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.62 KB
Format:
Item-specific license agreed upon to submission
Description: