A data mining approach based on machine learning techniques to classify biological sequences

作者：

Highlights：

•

摘要

In molecular biology, biological macromolecules, like desoxyribonucleic acids (DNA) and proteins are coded by strings, called ‘primary structures’. For a long time, biologists gathered these primary structures in large databases. Now, they focus on analyzing these primary structures in order to extract useful knowledge. Data mining approaches can be helpful to reach this goal. In this paper, we present a data mining approach based on machine learning techniques to do classification of biological sequences. By using our approach, we use four steps as follows. (1) In the first step, we construct the set of the discriminant substrings, called discriminant descriptor (DD), associated with each family of primary structures. This construction is made thanks to an adaptation of the Karp, Miller and Rosenberg (KMR) algorithm. (2) In the second step, we use the DDs constructed during the first step to code the families of primary structures by a table of examples vs attributes, called ‘context’. (3) In the third step, we extract knowledge from the context constructed during the second step and represent it by production rules. This extraction is made by using an incremental production rules approach. (4) Finally, during the last step, we use the obtained production rules to do classification of primary structures.

论文关键词：Classification,Data mining,Machine learning,Production rules,Biological sequences,Discriminant substrings,Primary structures

论文评审过程：Received 15 March 2001, Revised 16 April 2001, Accepted 31 May 2001, Available online 23 February 2002.

论文官网地址：https://doi.org/10.1016/S0950-7051(01)00143-5