Document ranking for variable bit-block compression signatures

作者:

Highlights:

摘要

Variable bit-block compression (VBC) signature is extended for document ranking. Two different extensions were experimented: the weighted VBC (WVBC) scheme and the aggregate VBC (AVBC) scheme. For both, analytical bounds of the additional storage for the term frequencies were derived. The upper and lower bounds of WVBC signatures were better than the corresponding bounds for AVBC signatures. In general, these bounds are functions of the word size (in bits) of the term frequencies. Therefore, term frequencies were scaled to reduce the word size. Experiments showed that the additional storage cost is closer to the lower than the upper bound for both WVBC and AVBC signatures. In addition, WVBC signatures were better than AVBC signatures in terms of storage and retrieval speed. Logarithmic scaling was found to be significantly better than linear scaling, in measuring the agreement of document ranking against the case without scaling, using the Kendall rank-order correlation. If a 75% ranking performance is acceptable, then the additional storage of the term frequencies is only 3.4% of all the indexed documents.

论文关键词:Information retrieval,Signature file,Compression,Indexing and document ranking

论文评审过程:Received 20 June 1999, Accepted 1 March 2000, Available online 6 December 2000.

论文官网地址:https://doi.org/10.1016/S0306-4573(00)00020-0