A data mining approach to discover unusual folding regions in genome sequences

作者:

Highlights:

摘要

Numerous experiments and analyses of RNA structures have revealed that the local distinct structure closely correlates with the biological function. In this study, we present a data mining approach to discover such unusual folding regions (UFRs) in genome sequences. Our approach is a three-step procedure. During the first step, the quality of a local structure different from a random folding in a genomic sequence is evaluated by two z-scores, significance score (SIGSCR) and stability score (STBSCR) of the local segment. The two scores are computed by sliding a fixed window stepped a base along the sequence from the start to end position. Next, based on the non-central Student's t distribution theory we derive a linearly transformed non-central Student's t distribution (LTNSTD) to describe the distribution of SIGSCR and STBSCR computed in the sequence. In the third step, we extract these significant UFRs from the sequence whose SIGSCR and/or STBSCR are greater or less than a given threshold calculated from the derived LTNSTD. Our data mining approach is successfully applied to the complete genome of Mycoplasma genitalium (M. gen) and discovers these statistical extremes in the genome. By comparisons with the two scores computed from randomly shuffled sequences of the entire M. gen genome, our results demonstrate that the UFRs in the M. gen sequence are not selected by chance. These UFRs may imply an important structure role involved in their sequence information.

论文关键词:Data mining,Statistical model,RNA/DNA folding,UFR

论文评审过程:Received 15 March 2001, Revised 4 May 2001, Accepted 31 May 2001, Available online 23 February 2002.

论文官网地址:https://doi.org/10.1016/S0950-7051(01)00146-0