Labeling clusters from both linguistic and statistical perspectives: A hybrid approach

作者:

Highlights:

摘要

Document clustering refers to grouping similar documents together automatically. Labels of the clusters, usually edited manually, are helpful for users to quickly grasp the major meaning of the grouped documents. Therefore, high quality labels are desired in many user-facing applications. However, assigning the labels manually is time consuming and tedious. In this paper a hybrid approach is proposed to automate the labeling process. First, linguistic knowledge are used to ensure candidate labels’ readability and information quantity by exploring the dependencies between words. Second, a statistical generative model is proposed to select representative labels. It scores a label w.r.t. a cluster by estimating how likely the cluster is generated by the label. The proposed approach is evaluated on two data sets in both English and Chinese. Experimental results show that the proposed approach produces high quality labels and outperforms existing state-of-art methods on both manual and automatic evaluations.

论文关键词:Cluster labeling,Dependency parsing,Context sensitive scoring,Rule learning,Phrase extraction

论文评审过程:Received 17 February 2014, Revised 14 December 2014, Accepted 16 December 2014, Available online 26 December 2014.

论文官网地址:https://doi.org/10.1016/j.knosys.2014.12.019