The method of N-grams in large-scale clustering of DNA texts
作者:
Highlights:
•
摘要
This paper is devoted to the techniques of clustering of texts based on the comparison of vocabularies of N-grams. In contrast to the regular N-grams approach, the proposed N-grams method is based on calculation of imperfect occurrences of N-grams in a text up to a number of mismatched strings. We demonstrated that such an approach essentially improves the resolving capacity of the N-grams method for DNA texts. Additionally, we discuss a mutual usage scheme of different clustering technique types to verify the partition quality.
论文关键词:N-grams,Strings mismatching,Clustering,Genome comparisons,Compositional spectra
论文评审过程:Received 30 July 2004, Revised 2 May 2005, Accepted 2 May 2005, Available online 11 July 2005.
论文官网地址:https://doi.org/10.1016/j.patcog.2005.05.002