Browsing by Author "Zhong, Wenyan"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Open Access Bi-level Variable Selection in Semiparametric Transformation Models for Right Censored Data and Cure Rate Data(2019-01-25) Zhong, Wenyan; Wu, Jingjing; Lu, Xuewen; Chen, Gemai; De Leon, Alexander R.; Shen, Hua; Kong, LinglongIn this dissertation, I investigated the bi-level variable selection in the semi-parametric transformation models with right-censored data and the semi-parametric mixture cure models with right censored and cure rate data, respectively. The transformation models under the consideration include the proportional hazards model and the proportional odds model as special cases. In the framework of regularized regression, we proposed a computationally efficient estimation method that selects significant groups and variables simultaneously. Three penalty functions, i.e., Group bridge, adaptive group bridge and composite group bridge penalties which can integrate grouping structure of covariates, were adopted for bi-level variable selection purpose. In Chapter 2, the objective function, which consists of the negative weighted partial log-likelihood function plus one of the three penalties, has a parametric form and is convex with respect to the parameters. This leads to an easy implementation of the optimization algorithm for which convergence is guaranteed numerically. We showed that all the three proposed penalized estimators achieve the group selection consistency, and moreover, the adaptive group bridge estimator and the composite group bridge estimator enjoy the oracle properties, i.e., both estimators possess the group and individual selection consistency simultaneously and are asymptotically normal as if the true unimportant covariates were known. In Chapter 3, we further extended the bi-level variable selection procedure to the semi-parametric mixture cure models. The semi-parametric mixture cure models are formulated by a logistic regression for modelling the cure fraction and a class of semi-parametric transformation models for modelling the survival function of remaining uncured individuals. Incorporating a cure fraction, the proposed model is more flexible than the standard survival models, and the proposed approach is capable to distinguish important covariates and groups from unimportant ones and estimate covariates’ effects simultaneously in both the incidence and the latency parts. We proposed a new iterative E-M algorithm to handle two latent variables. We illustrated the finite sample performance of the proposed methods via simulations and two real data examples. Simulation studies indicated that the proposed methods perform well even with relatively high dimension of covariates.Item Open Access Feature selection for cancer classification using microarray gene expression data(2014-09-29) Zhong, Wenyan; Wu, Jingjing; Lu, XuewenThe rapid development of DNA microarray technology enables researchers to measure the expression levels of thousands of genes simultaneously and allows biologists easily gain insight into the complex interaction in tumours on gene expression levels. Its application in cancer studies has been shown great success in both diagnosis and elucidating the pathological mechanism. However, DNA microarray data usually contains thousands of genes and most of them are proved to be uninformative and redundant. Meanwhile, small size of samples of microarray data undermines the diagnosis accuracy of statistical models. Thereby, selecting highly discriminative genes from raw gene expression data can improve the performance of cancer classification and cut down the cost of medical diagnosis. This M.Sc. thesis proposes and investigates a new method of selecting highly discriminative genes for cancer classification based on DNA microarray data. For two-group classification problem, the Bhattacharyya distance is proposed to measure the dissimilarity in gene expression levels between the two groups. For any particular gene, we calculate the Bhattacharrya distance between the two groups based on the expression levels of that particular gene. We use the calculated distances, one for each gene, as a criteria to rank all the genes. Finally, support vector machine is utilized to obtain the optimal subset of genes achieving the lowest misclassification rate. Compared with the other two methods, SWKC (supervised weighted kernel clustering) (Shim et al., 2009) and SVM-RFE (support vector machine with recursive feature elimination) (Guyon et al., 2002), the proposed method is shown to be more effective and sensitive to differentially expressed genes. In the simulation study, the proposed method has much higher recovery rate than the other two methods. Comparisons among these three gene selection methods are also made through two real DNA microarray datasets, the colon dataset and the leukemia dataset, that are publicly available. Based on three classification performance indexes, i.e. average number of genes selected, average number of classification errors in test set and misclassification rate, the proposed method gets slightly better classification results than SVM-RFE for the colon dataset while at a much less computation cost. It also achieves better classification results than the SWKC methods in both datasets. Finally, we discuss that in future work improvement in performance could be achieved by introducing kernel density estimators and replacing Bhattacharyya distance with Hellinger distance as a feature selection criteria. Since kernel density estimation is free of distribution assumptions, under which the classification results would be more robust than that obtained by the Bhattacharyya distance under normal assumption.