Optimizing Recursive Information Gathering Plans in EMERAC
作者:Subbarao Kambhampati, Eric Lambrecht, Ullas Nambiar, Zaiqing Nie, Gnanaprakasam Senthil
摘要
In this paper we describe two optimization techniques that are specially tailored for information gathering. The first is a greedy minimization algorithm that minimizes an information gathering plan by removing redundant and overlapping information sources without loss of completeness. We then discuss a set of heuristics that guide the greedy minimization algorithm so as to remove costlier information sources first. In contrast to previous work, our approach can handle recursive query plans that arise commonly in the presence of constrained sources. Second, we present a method for ordering the access to sources to reduce the execution cost. This problem differs significantly from the traditional database query optimization problem as sources on the Internet have a variety of access limitations and the execution cost in information gathering is affected both by network traffic and by the connection setup costs. Furthermore, because of the autonomous and decentralized nature of the Web, very little cost statistics about the sources may be available. In this paper, we propose a heuristic algorithm for ordering source calls that takes these constraints into account. Specifically, our algorithm takes both access costs and traffic costs into account, and is able to operate with very coarse statistics about sources (i.e., without depending on full source statistics). Finally, we will discuss implementation and empirical evaluation of these methods in Emerac, our prototype information gathering system.
论文关键词:data integration, information gathering, Web and databases, query optimization
论文评审过程:
论文官网地址:https://doi.org/10.1023/B:JIIS.0000012467.66268.9e