Intelligent topic selection for low-cost information retrieval evaluation: A New perspective on deep vs. shallow judging

Highlights：

• We revisited shallow vs. deep judging using our intelligent topic selection method and considering a wider range of factors impacting this trade-off than previously considered. Based on our extensive experiments, our findings are as follows.

• Shallow judging is preferable than deep judging if topics are selected randomly, confirming findings of prior work. However, when topics are selected intelligently, deep judging often achieves greater evaluation reliability for the same evaluation budget than shallow judging.

• As the topic generation cost increases, deep judging becomes more costeffective than shallow judging in optimizing the evaluation budget.

• Assuming that judging speed increases as more documents for the same topic are judged, increased judging speed has significant effect on evaluation reliability, suggesting that it should be another parameter to be considered in deep vs. shallow judging trade-off.

• Assuming that short topic generation times reduce the quality of topics, and thereby, relevance judgments consistency, it is better to invest a portion of our evaluation budget to increase quality of topics, instead of collecting more judgments for low-quality topics.

摘要

•We proposed a novel learning-to-rank based topic selection method to more intelligently design topic sets in test collections. Our method can be used to select the best topics from a topic pool in order to maximize the reliability of evaluation while reducing the needed human judging effort.•We revisited shallow vs. deep judging using our intelligent topic selection method and considering a wider range of factors impacting this trade-off than previously considered. Based on our extensive experiments, our findings are as follows.•Shallow judging is preferable than deep judging if topics are selected randomly, confirming findings of prior work. However, when topics are selected intelligently, deep judging often achieves greater evaluation reliability for the same evaluation budget than shallow judging.•As the topic generation cost increases, deep judging becomes more costeffective than shallow judging in optimizing the evaluation budget.•Assuming that judging speed increases as more documents for the same topic are judged, increased judging speed has significant effect on evaluation reliability, suggesting that it should be another parameter to be considered in deep vs. shallow judging trade-off.•Assuming that short topic generation times reduce the quality of topics, and thereby, relevance judgments consistency, it is better to invest a portion of our evaluation budget to increase quality of topics, instead of collecting more judgments for low-quality topics.