Indexing genomic sequence libraries

作者:

Highlights:

摘要

This paper describes an extensible, open-source (GPL) data repository and retrieval system that supports fast, efficient, keyword based retrieval of genomic sequences from multiple libraries with retrieved sequences post-processed by FASTA, Smith–Waterman and other analysis software. This application is implemented for Linux and is written in Mumps, C, and C++ with supporting components that include the Berkeley Data Base, the Perl Compatible Regular Expression Library, GLADE, and tools such as FASTA, Smith–Waterman, and modules from EMBOSS. The package described here can quickly index data sets of up to 256 terabytes using a B-tree based multi-dimensional data model. An example is presented that indexes the text of the full NCBI Genbank library.

论文关键词:Bioinformatics,Sequence retrieval,Genomics,Information retrieval,Mumps

论文评审过程:Received 5 April 2002, Accepted 8 September 2003, Available online 14 October 2003.

论文官网地址:https://doi.org/10.1016/j.ipm.2003.09.001