Finding peculiar compositions of two frequent strings with background texts
作者:Daisuke Ikeda, Einoshin Suzuki
摘要
We consider mining unusual patterns from a set \(T\) of target texts. A typical method outputs unusual patterns if their observed frequencies are far from their expectation estimated under an assumed probabilistic model. However, it is difficult for the method to deal with the zero frequency and thus it suffers from data sparseness. We employ another set \(B\) of background texts to define a composition \(xy\) to be peculiar if both \(x\) and \(y\) are more frequent in \(B\) than in \(T\) and conversely \(xy\) is more frequent in \(T\). \(xy\) is unusual because \(x\) and \(y\) are infrequent in \(T\) while \(xy\) is unexpectedly frequent compared to \(xy\) in \(B\). To find frequent subpatterns and infrequent patterns simultaneously, we develop a fast algorithm using the suffix tree and show that it scales almost linearly under practical settings of parameters. Experiments using DNA sequences show that found peculiar compositions basically appear in rRNA while patterns found by an existing method seem not to relate to specific biological functions. We also show that discovered patterns have similar lengths at which the distribution of frequencies of fixed length substrings begins to skew. This fact explains why our method can find long peculiar compositions.
论文关键词:Algorithms, Exceptional pattern mining, Text mining, Bioinformatics, Genetic maps, Suffix trees
论文评审过程:
论文官网地址:https://doi.org/10.1007/s10115-013-0688-9