Using Web structure and summarisation techniques for Web content mining

作者：

Highlights：

•

摘要

The dynamic nature and size of the Internet can result in difficulty finding relevant information. Most users typically express their information need via short queries to search engines and they often have to physically sift through the search results based on relevance ranking set by the search engines, making the process of relevance judgement time-consuming. In this paper, we describe a novel representation technique which makes use of the Web structure together with summarisation techniques to better represent knowledge in actual Web Documents. We named the proposed technique as Semantic Virtual Document (SVD). We will discuss how the proposed SVD can be used together with a suitable clustering algorithm to achieve an automatic content-based categorization of similar Web Documents. The auto-categorization facility as well as a “Tree-like” Graphical User Interface (GUI) for post-retrieval document browsing enhances the relevance judgement process for Internet users. Furthermore, we will introduce how our cluster-biased automatic query expansion technique can be used to overcome the ambiguity of short queries typically given by users. We will outline our experimental design to evaluate the effectiveness of the proposed SVD for representation and present a prototype called iSEARCH (Intelligent SEarch And Review of Cluster Hierarchy) for Web content mining. Our results confirm, quantify and extend previous research using Web structure and summarisation techniques, introducing novel techniques for knowledge representation to enhance Web content mining.

论文关键词：Knowledge representation of Web Documents,Web structure,Summarisation,Web content mining,Content-based automatic Web Document clustering

论文评审过程：Received 9 July 2003, Accepted 19 August 2004, Available online 7 October 2004.

论文官网地址：https://doi.org/10.1016/j.ipm.2004.08.003