Knowledge based collection selection for distributed information retrieval

作者：

Highlights：

•

摘要

Recent years have seen a great deal of work on collection selection. Most collection selection methods use central sample index (CSI) that consists of some documents sampled from each collection as collection description. The limitations of these methods are the usage of ‘flat’ meaning representations that ignore structure and relationships among words in CSI, and the calculation of query-collection similarity metric that ignore semantic distance between query words and indexed words. In this paper, we propose a knowledge based collection selection method (KBCS) to improve collection representation and query-collection similarity metric. KBCS models a collection as a weighted entity set and applies a novel query-collection similarity metric to select highly scored collections. Specifically, in the part of collection representation, context- and structure-based measures are employed to weight the semantic distance between two entities extracted from the sampled documents of a collection. In addition, the novel query-collection similarity metric takes the entity weight, collection size, and other factors into account. To enrich concepts contained in a query, DBpedia based query expansion is integrated. Finally, extensive experiments were conducted on a large webpage dataset, and DBpedia was chosen as the graph knowledge base. Experimental results demonstrate the effectiveness of KBCS.

论文关键词：Collection selection,Distributed information retrieval,Knowledge base,Query expansion

论文评审过程：Received 1 April 2017, Revised 28 August 2017, Accepted 11 October 2017, Available online 23 October 2017, Version of Record 23 October 2017.

论文官网地址：https://doi.org/10.1016/j.ipm.2017.10.002