Multivariate selection of genetic markers in diagnostic classification
作者:
Highlights:
•
摘要
Analysis of gene expression data obtained from microarrays presents a new set of challenges to machine learning modeling. In this domain, in which the number of variables far exceeds the number of cases, identifying relevant genes or groups of genes that are good markers for a particular classification is as important as achieving good classification performance. Although several machine learning algorithms have been proposed to address the latter, identification of gene markers has not been systematically pursued. In this article, we investigate several algorithms for selecting gene markers for classification. We test these algorithms using logistic regression, as this is a simple and efficient supervised learning algorithm. We demonstrate, using 10 different data sets, that a conditionally univariate algorithm constitutes a viable choice if a researcher is interested in quickly determining a set of gene expression levels that can serve as markers for disease. We show that the classification performance of logistic regression is not very different from that of more sophisticated algorithms that have been applied in previous studies, and that the gene selection in the logistic regression algorithm is reasonable in both cases. Furthermore, the algorithm is simple, its theoretical basis is well established, and our user-friendly implementation is now freely available on the internet, serving as a benchmarking tool for the development of new algorithms.
论文关键词:Microarray,Logistic regression,Variable selection,Classification,Clustering,Principal components analysis
论文评审过程:Received 28 February 2003, Revised 1 April 2003, Accepted 16 January 2004, Available online 14 May 2004.
论文官网地址:https://doi.org/10.1016/j.artmed.2004.01.011