Document length normalization

作者:

Highlights:

摘要

In the TREC collection—a large full-text experimental text collection with widely varying document lengths—we observe that the likelihood of a document being judged relevant by a user increases with the document length. We show that a retrieval strategy, such as the vector-space cosine match, that retrieves documents of different lengths with roughly equal chances, will not optimally retrieve useful documents from such a collection. We present a modified technique—pivoted cosine normalization—that attempts to match the likelihood of retrieving documents of all lengths to the likelihood of their relevance, and show that this technique yields significant improvements in retrieval effectiveness.

论文关键词:

论文评审过程:Received 9 October 1995, Accepted 15 December 1995, Available online 19 February 1999.

论文官网地址:https://doi.org/10.1016/0306-4573(96)00008-8