Deep Web adaptive crawling based on minimum executable pattern

作者:Jun Liu, Lu Jiang, Zhaohui Wu, Qinghua Zheng

摘要

The key to Deep Web Crawling is to submit valid input values to a query form and retrieve Deep Web content efficiently. In the literature, related work focus only on generic text boxes or entire query forms, causing the problem of “data islands” or inferior validity of query submission. This paper proposes the concept of Minimum Executable Pattern (MEP), a minimal combination of elements in a query form that can conduct a successful query, and then presents a MEPGeneration method and a MEP-based Deep Web adaptive crawling method. The query form is parsed and partitioned into MEP set, and then local-optimal queries are generated by choosing a MEP in the MEP set and a keyword vector of the MEP. Furthermore, the crawler can make a decision on its termination to balance the trade-off between high coverage of the content and resource consumption. The adoption of MEP is expected to improve the validity of query submission, and adaptive selection of multiple MEPs shows good effect for overcoming the problem of “data islands”. We present a set of experiments to validate the effectiveness of the proposed method. Experimental results show that our method outperforms the state of art methods in terms of query capability and applicability, and on average, it achieves good coverage by issuing only a few hundred queries.

论文关键词:Deep Web, Minimum executable pattern, Adaptive crawling, Deep Web surfacing

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10844-010-0124-5