The improvement of breast cancer prognosis accuracy from integrated gene expression and clinical data

作者:

Highlights:

摘要

Predicting the accurate prognosis of breast cancer from high throughput microarray data is often a challenging task. Although many statistical methods and machine learning techniques were applied to diagnose the prognosis outcome of breast cancer, they are suffered from the low prediction accuracy (usually lower than 70%). In this paper, we propose a better method (genetic algorithm–support vector machine, we called GASVM) to significant improve the prediction accuracy of breast cancer from gene expression profiles. To further improve the classification performance, we also apply GASVM model using combined clinical and microarray data. In this paper, we evaluate the performance of the GASVM model based on data provided by 97 breast cancer patients. Four kinds of gene selection methods are used: all genes (All), 70 correlation-selected genes (C70), 15 medical literature-selected genes (R15), and 50 T-test-selected genes (T50). With optimized parameter values identified from GASVM model, the average predictive accuracy of our model approaches 95% for T50 and 90% for C70 or R15 in all four kernel functions using integrated clinical and microarray data. Our model produces results more accurately than the average 70% predictive accuracy of other machine learning methods. The results indicate that the GASVM model has the potential to better assist physicians in the prognosis of breast cancer through the use of both clinical and microarray data.

论文关键词:Breast cancer prognosis,Genetic algorithm,Support vector machine,Gene selection,Cancer classification,Gene expression,Clinical data

论文评审过程:Available online 2 October 2011.

论文官网地址:https://doi.org/10.1016/j.eswa.2011.09.144