Neural networks for full-scale protein sequence classification: Sequence encoding with singular value decomposition

作者:Cathy Wu, Michael Berry, Sailaja Shivakumar, Jerry McLarty

摘要

A neural network classification method has been developed as an alternative approach to the search/organization problem of protein sequence databases. The neural networks used are three-layered, feed-forward, back-propagation networks. The protein sequences are encoded into neural input vectors by a hashing method that counts occurrences ofn-gram words. A new SVD (singular value decomposition) method, which compresses the long and sparsen-gram input vectors and captures semantics ofn-gram words, has improved the generalization capability of the network. A full-scale protein classification system has been implemented on a Cray supercomputer to classify unknown sequences into 3311 PIR (Protein Identification Resource) superfamilies/families at a speed of less than 0.05 CPU second per sequence. The sensitivity is close to 90% overall, and approaches 100% for large superfamilies. The system could be used to reduce the database search time and is being used to help organize the PIR protein sequence database.

论文关键词:neural networks, database search, protein classification, sequence analysis, superfamily, singular value decomposition (SVD)

论文评审过程:

论文官网地址:https://doi.org/10.1007/BF00993384