Query directed clustering

作者:Daniel Crabtree, Xiaoying Gao, Peter Andreae

摘要

This paper identifies the conditions under which web page clustering algorithms are effective and identifies the problems that cause them to fail. It then presents Query Directed Clustering (QDC), a web page clustering algorithm that produces higher-quality clusterings than other clustering algorithms for easy ambiguous queries, while performing at least as well as other clustering algorithms on queries for which clustering is not well suited. QDC has the five key innovations: a new cluster quality guide that is based on the relationship between clusters and the query; an improved cluster merging method that considers both cluster overlap and cluster description similarity; a new cluster splitting method that addresses the cluster chaining (drifting) problem; an improved heuristic for selecting good clusters; a new method that improves the clusters by ranking the pages in each cluster. Our experiments evaluate QDC both quantitatively and qualitatively and show that QDC significantly improves clustering performance, while being substantially more efficient than existing approaches.

论文关键词:Web page clustering, Data mining, Clustering

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10115-012-0564-z