Multi-label classifier for protein sequence using heuristic-based deep convolution neural network
作者:Vikas Chauhan, Aruna Tiwari, Niranjan Joshi, Sahaj Khandelwal
摘要
Deep learning techniques are found very useful to classify sequential data in recent times. The protein sequences belong to the functional classes based on the structure of their sequences. The annotation task of protein sequences into corresponding functional classes is multi-label in nature. The primary structure of protein contains a notable amount of vast data compared to the other secondary, tertiary, and quaternary structures. The clustering-based techniques require expert domain knowledge from the extensive data samples. Traditional methods use the n-gram features of amino acids while ignoring the relationship of motifs and amino acid sequence. This paper proposes an efficient method to classify the proteins into their functional classes using a convolution neural network based on heuristic rules. The proposed approach works on the primary structure of protein sequences which considers the relationship among motifs and amino acids. The proposed approach also takes into account the amino acid locations in the protein sequence. The proposed approach considers the affinity information between amino acids and motifs. Along with achieving high performance in the classification of protein sequences, we propose a heuristic approach to improve the precision and recall of the individual functional classes. The proposed heuristic approach improves the performance and handles the data imbalance problem. The proposed approach is compared with other competitive approaches, and our approach provides better performance metrics in terms of precision, recall, AUC, and subset accuracy. The greatest challenge with multi-label classification is to handle the data imbalance, which appears due to variance in frequencies of the labels in the data. This data imbalance is dealt with weight modulation in the loss function to influence the learning process.
论文关键词:Multi-label classification, Genomics, Heuristic, Primary protein structure
论文评审过程:
论文官网地址:https://doi.org/10.1007/s10489-021-02529-6