Ensemble learning for protein multiplex subcellular localization prediction based on weighted KNN with different features

作者:Shanping Qiao, Baoqiang Yan, Jing Li

摘要

As an important attribute of proteins, protein subcellular location(s) can provide valuable information about their functions. Determining protein subcellular locations using experimental methods are usually expensive and time-consuming. Over the years, a variety of computational approaches have been developed to predict protein subcellular locations based on knowledge of known protein locations. However, the problem is inherently hard, especially for proteins that can exist at multiple subcellular locations. Further studies are still in great need in this area. In this paper, we propose an ensemble learning framework that utilizes a modified Weighted K-Nearest Neighbors (WKNN) as the basic learning algorithm. Two different types of features are considered and extracted from training data, which are based on protein amino acid compositions (Amphiphilic Pseudo Amino Acid Composition, or AmPseAAC) and protein sequence similarities (Protein Similarity Measure, or PSM), respectively. Two individual classifiers are trained separately based on these two types of features and each assigns a probability distribution over different locations to a query protein. Based on the outputs of the two base classifiers, a novel ensemble strategy named Maximized Probability on Label (MPoL) is proposed. The strategy produces a final set of protein locations for each protein by integrating prediction results of the base classifiers through an optimization procedure. To measure the prediction quality of the proposed approach, two different types of evaluation metrics, example-based metrics and label-based metrics, are used. To evaluate the performance of our approach objectively, we compare its results with those predicted by another popular method named iLoc-Animal on a benchmark dataset through cross-validation. Results show that in terms of absolute true success rate on multi-location prediction, MPoL has achieved much better results than iLoc-Animal. It implies that the proposed method has some potential to solve a diverse set of multi-label learning problems.

论文关键词:Ensemble learning, Feature fusion, Protein subcellular location prediction, Weighted k-nearest neighbors

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10489-017-1029-6