Cancer biomarker extraction from gene expression microarray data
Date
2008
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Bioinformatics is a new field of science mainly integrating computer science, mathematics, statistics and biology where the aim is to discover knowledge hidden within biological data. One of the widely investigated biological data is gene expression microarray data. Profiling the global gene expression patterns in different tissues/ sample can be investigated in few days due to microarray technology, which can accommodate the whole genome, unlike traditional methods which may take months. However, analyzing micro array data is challenging as the number of features (genes) is very large relative to the number of attributes (samples). Fortunately, microarray has been successfully used to study gene expression data; this allowed researchers to investigate different diseases, including cancer. In other words, using microarray in cancer diagnosis showed to be very efficient and reliable, but the large number of genes makes the data noisy and difficult to deal with. Consequently, identifying relevant genes has received considerable attention. In this thesis, we combine biological knowledge with machine learning techniques to propose three methods for extracting the most informative genes for cancer classification. The first method is based on double clustering; we filter the data initially with a statistical test and then cluster the data iteratively to get the best number of clusters. The genes closest to the centroids of the resulting clusters showed to have high potential to be significant features for sample classification. These genes (one per centroid) are used as input for building a classification model. The second method is based on iterative t-test in a way that eliminates noise from the data. The third method is a hybrid approach which combines statistical tests with entropy based tests. This method uses the t-test and Singular Value Decomposition (SVD) based entropy. It showed to be effective as it considers the feature itself and its effect on the data entropy. This approach is the first to combine entropy and statistical significance for gene ranking. We have also developed SVD based gene extraction method for multi-class data; only introduced at high level in this thesis, details are left are future work. The test results reported demonstrate the applicability and effectiveness of the three proposed approaches. _x000D_
Index Terms: Classification, clustering, t-test, singular value decomposition, support vector machine, microarray data, gene expression data, over-expression, underexpress10n._x000D_
Description
Bibliography: p. 115-124
some pages are in colour
some pages are in colour
Keywords
Citation
Alshalalfa, M. (2008). Cancer biomarker extraction from gene expression microarray data (Master's thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca. doi:10.11575/PRISM/2304