Combining preference- and content-based approaches for improving document clustering effectiveness

作者:

Highlights:

摘要

E-commerce and knowledge management applications generate and consume tremendous amounts of online information that is typically available as textual documents. To facilitate subsequent access of and leverage from these textual documents, the efficient and effective management of the ever-increasing volume of documents is essential to both organizations and individuals. Document management practices suggest the popularity of using categories (e.g., folders) for organizing, archiving, and accessing documents. Document clustering represents an appealing approach to enable organizations or individuals to create and maintain document categories automatically. Existing document clustering techniques usually group together similar documents on the basis of their textual content similarity. However, such content-based approaches operate at the lexical level and suffer greatly from the word mismatch problem. Therefore, this study aims to address this problem by exploiting users’ document grouping preferences, as exhibited in those individuals’ folder sets, to support document clustering. Specifically, we propose a hybrid document clustering technique that combines preference- and content-based approaches. Using a traditional content-based and a preference/content switching document clustering technique as performance benchmarks, our empirical evaluation results show that the proposed hybrid technique improves the clustering effectiveness measured by both cluster precision and cluster recall.

论文关键词:Document clustering,Hierarchical agglomerative clustering,Preference-based document clustering,Document management,Digital library

论文评审过程:Received 3 May 2004, Accepted 16 June 2005, Available online 24 August 2005.

论文官网地址:https://doi.org/10.1016/j.ipm.2005.06.008