Automatic generation of agents for collecting hidden Web pages for data extraction
作者:
Highlights:
•
摘要
As the Web grows, more and more data has become available under dynamic forms of publication, such as legacy databases accessed by an HTML form (the so called hidden Web). In situations such as this, integration of this data relies more and more on the fast generation of agents that can automatically fetch pages for further processing. As a result, there is an increasing need for tools that can help users generate such agents. In this paper, we describe a method for automatically generating agents to collect hidden Web pages. This method uses a pre-existing data repository for identifying the contents of these pages and takes the advantage of some patterns that can be found among Web sites to identify the navigation paths to follow. To demonstrate the accuracy of our method, we discuss the results of a number of experiments carried out with sites from different domains.
论文关键词:Collecting agents,Hidden Web,Navigation patterns
论文评审过程:Available online 6 November 2003.
论文官网地址:https://doi.org/10.1016/j.datak.2003.10.003