Semantic smoothing for text clustering

作者:

Highlights:

摘要

In this paper we present a new semantic smoothing vector space kernel (S-VSM) for text documents clustering. In the suggested approach semantic relatedness between words is used to smooth the similarity and the representation of text documents. The basic hypothesis examined is that considering semantic relatedness between two text documents may improve the performance of the text document clustering task. For our experimental evaluation we analyze the performance of several semantic relatedness measures when embedded in the proposed (S-VSM) and present results with respect to different experimental conditions, such as: (i) the datasets used, (ii) the underlying knowledge sources of the utilized measures, and (iii) the clustering algorithms employed. To the best of our knowledge, the current study is the first to systematically compare, analyze and evaluate the impact of semantic smoothing in text clustering based on ‘wisdom of linguists’, e.g., WordNets, ‘wisdom of crowds’, e.g., Wikipedia, and ‘wisdom of corpora’, e.g., large text corpora represented with the traditional Bag of Words (BoW) model. Three semantic relatedness measures for text are considered; two knowledge-based (Omiotis [1] that uses WordNet, and WLM [2] that uses Wikipedia), and one corpus-based (PMI [3] trained on a semantically tagged SemCor version). For the comparison of different experimental conditions we use the BCubed F-Measure evaluation metric which satisfies all formal constraints of good quality cluster. The experimental results show that the clustering performance based on the S-VSM is better compared to the traditional VSM model and compares favorably against the standard GVSM kernel which uses word co-occurrences to compute the latent similarities between document terms.

论文关键词:Text clustering,Semantic smoothing kernels,WordNet,Wikipedia,Generalized vector space model kernel

论文评审过程:Received 26 February 2013, Revised 5 September 2013, Accepted 7 September 2013, Available online 24 September 2013.

论文官网地址:https://doi.org/10.1016/j.knosys.2013.09.012