Long distance bigram models applied to word clustering

作者:

Highlights:

摘要

Two novel word clustering techniques are proposed which employ long distance bigram language models. The first technique is built on a hierarchical clustering algorithm and minimizes the sum of Mahalanobis distances of all words after a cluster merger from the centroid of the class created by merging. The second technique resorts to probabilistic latent semantic analysis (PLSA). Next, interpolated long distance bigrams are considered in the context of the aforementioned clustering techniques. Experiments conducted on the English Gigaword corpus (second edition) demonstrate that: (1) the long distance bigrams, when employed in the two clustering techniques under study, yield word clusters of better quality than the baseline bigrams; (2) the interpolated long distance bigrams outperform the long distance bigrams in the same respect; (3) the long distance bigrams perform better than the bigrams, which incorporate trigger-pairs selected at various distances; and (4) the best word clustering is achieved by the PLSA that employs interpolated long distance bigrams. Both proposed techniques outperform spectral clustering based on k-means. To assess objectively the quality of the created clusters, relative cluster validity indices are estimated as well as the average cluster sense precision, the average cluster sense recall, and the F-measure are computed by exploiting ground truth extracted from the WordNet.

论文关键词:Word clustering,Language modeling,Distance bigrams,Probabilistic latent semantic analysis,Relative cluster validity indices,Trigger-pairs,Spectral clustering,Cluster dispersion,Cluster sense precision,Cluster sense recall,WordNet

论文评审过程:Received 2 October 2009, Revised 23 February 2010, Accepted 2 July 2010, Available online 16 July 2010.

论文官网地址:https://doi.org/10.1016/j.patcog.2010.07.006