Performance standards and evaluations in IR test collections: Vector-space and other retrieval models

摘要

Low performance standards for each query and for the group of queries in 13 traditional and four TREC test collections have been computed. Predicted by the hypergeometric distribution, the standards represent the highest level of retrieval effectiveness attributable to chance. Operational levels of performance for vector-space, ad-hoc-feature-based, probabilistic, and other retrieval models have been compared to the standards. The effectiveness of these techniques in small, traditional test collections can be explained by retrieving a few more relevant documents for most queries than expected by chance, and the effectiveness of retrieval techniques in the large TREC test collections can only be explained by retrieving many more relevant documents for most queries than expected by chance. The discrepancy between deviations from chance in traditional and TREC text collections is due to a decrease in performance standards for large test collections, not to an increase in operational performance. Retrieving a few more relevant documents than expected by chance leads to mediocre levels of performance; recall and precision are rarely greater than 0.50 for any retrieval strategy in any test collection. However, marginal improvements to expectations based on chance may be sufficient to initiate successful interactions between an end-user and the next generation of retrieval systems, in which relevance judgments will be automatically translated into progressively improving estimates of the capacity of terms and other features to discriminate between relevant and non-relevant documents. Realization of such systems would be enhanced by abandoning uninformative performance summaries and focusing on effectiveness and improvements in effectiveness of individual queries.