How many clusters are best? - An experiment

作者:

Highlights:

摘要

This paper reports the results of a Monte Carlo study on the relative effectiveness of two internal indices in estimating the true number of clusters in multivariate data. The two indices are the Davies and Bouldin(1) index and a new modification of the Hubert Γ statistic(2). Data in d dimensions are clustered to create sequences of partitions. Estimates are based on plots of the indices as functions of the number of recovered clusters. Neither index uses a-priori information. The effects of sample size, dimensionality, cluster spread, number of true clusters, and sampling window are examined. Clustered data are generated to assure a given number of distinct clusters. The degree of clustering in the data is verified by a separate Monte Carlo study based on the Jaccard and corrected Rand indices that exhibits the importance of correcting external indices for chance.The modified Hubert index, proposed here for the first time, is shown to perform better than the Davies-Bouldin index under all experimental conditions. Recovery of the true number of clusters gets better as the number of true clusters decreases and as the number of dimensions increases. No effect occurs due to sampling window. The complete link clustering method and a square error clustering method recognize the true number of clusters consistently better than the single link method. This study demonstrates the difficulty inherent in estimating the number of clusters.

论文关键词:Cluster,Cluster analysis,Cluster validity,Internal Index,External index Stopping rule,Monte Carlo analysis

论文评审过程:Received 8 December 1986, Available online 19 May 2003.

论文官网地址:https://doi.org/10.1016/0031-3203(87)90034-3