EKNN: Ensemble classifier incorporating connectivity and density into kNN with application to cancer diagnosis

作者:

Highlights:

摘要

In the microarray-based approach for automated cancer diagnosis, the application of the traditional k-nearest neighbors kNN algorithm suffers from several difficulties such as the large number of genes (high dimensionality of the feature space) with many irrelevant genes (noise) relative to the small number of available samples and the imbalance in the size of the samples of the target classes. This research provides an ensemble classifier based on decision models derived from kNN that is applicable to problems characterized by imbalanced small size datasets. The proposed classification method is an ensemble of the traditional kNN algorithm and four novel classification models derived from it. The proposed models exploit the increase in density and connectivity using K1-nearest neighbors table (KNN-table) created during the training phase. In the density model, an unseen sample u is classified as belonging to a class t if it achieves the highest increase in density when this sample is added to it i.e. the unseen sample can replace more neighbors in the KNN-table for samples of class t than other classes. In the other three connectivity models, the mean and standard deviation of the distribution of the average, minimum as well the maximum distance to the K neighbors of the members of each class are computed in the training phase. The class t to which u achieves the highest possibility of belongness to its distribution is chosen, i.e. the addition of u to the samples of this class produces the least change to the distribution of the corresponding decision model for class t. Combining the predicted results of the four individual models along with traditional kNN makes the decision space more discriminative. With the help of the KNN-table which can be updated online in the training phase, an improved performance has been achieved compared to the traditional kNN algorithm with slight increase in classification time. The proposed ensemble method achieves significant increase in accuracy compared to the accuracy achieved using any of its base classifiers on Kentridge, GDS3257, Notterman, Leukemia and CNS datasets. The method is also compared to several existing ensemble methods and state of the art techniques using different dimensionality reduction techniques on several standard datasets. The results prove clear superiority of EKNN over several individual and ensemble classifiers regardless of the choice of the gene selection strategy.

论文关键词:Cancer diagnosis,Ensemble classification,Gene expression analysis,Nearest neighbors

论文评审过程:Received 14 December 2019, Revised 2 November 2020, Accepted 2 November 2020, Available online 8 November 2020, Version of Record 21 November 2020.

论文官网地址:https://doi.org/10.1016/j.artmed.2020.101985