A novel focused crawler combining Web space evolution and domain ontology

作者:

Highlights:

摘要

In many fields, how to catch the related-topic Web resources is crucial. As a vertical search method, focused crawler has received great attention in recent years. Currently, most focused crawlers consider multiple evaluating factors of the hyperlinks and use the weighted sum approach to compute the priorities of unvisited hyperlinks. However, the proper weighted coefficients are hard to determine, and their unsuitable values may even cause the direction of crawlers to deviate seriously from the topic. To overcome this issue, this article builds a multi-objective optimization model based on Web text and link structure and designs a crawler framework called the Web space evolution (WSE), where a hyperlink bank whose radius is gradually increased is introduced to extend the search scape of crawlers in Web space. To improve the uniformity and diversity of hyperlinks, a nearest and farthest candidate solution method is combined with the fast non-dominated sorting to choose Pareto-optimal solutions (hyperlinks). A domain ontology based on the formal concept analysis is applied to establish the topic model. By incorporating the WSE and the domain ontology into the focused crawling, a novel focused crawler called FCWSEO is proposed to collect topic-relevant webpages. The experimental results on the rainstorm disaster domain show that the FCWSEO outperforms other focused crawler strategies in terms of the quantity and quality of retrieved relevant webpages.

论文关键词:Focused crawler,Web space evolution,Multi-objective optimization,Pareto optimal,Ontology

论文评审过程:Received 20 June 2021, Revised 19 February 2022, Accepted 22 February 2022, Available online 28 February 2022, Version of Record 15 March 2022.

论文官网地址:https://doi.org/10.1016/j.knosys.2022.108495