Linear manifold clustering in high dimensional spaces by stochastic search

作者:

Highlights:

摘要

Classical clustering algorithms are based on the concept that a cluster center is a single point. Clusters which are not compact around a single point are not candidates for classical clustering approaches. In this paper we present a new clustering paradigm in which the cluster center is a linear manifold. Clusters are groups of points compact around a linear manifold. A linear manifold of dimension 0 is a point. So clustering around a center point is a special case of linear manifold clustering. Linear manifold clustering (LMCLUS) identifies subsets of the data which are embedded in arbitrary oriented lower dimensional linear manifolds. Minimal subsets of points are repeatedly sampled to construct trial linear manifolds of various dimensions. Histograms of the distances of the points to each trial manifold are computed. The sampling corresponding to the histogram having the best separation between a mode near zero and the rest is selected and the data points are partitioned on the basis of the best separation. The repeated sampling then continues recursively on each block of the partitioned data. A broad evaluation of some 100 experiments over real and synthetic data sets demonstrates the general superiority of this algorithm over any of the competing algorithms in terms of accuracy and computation time. Its expected computational time is linearly proportional to the data set dimension and data set size. Its accuracy ranges from near 0.90 to 0.99 depending on the experiment and is generally much higher than the accuracy of the competing clustering algorithms.

论文关键词:Clustering,Linear manifold,Subspace,Histogram thresholding,Data exploration,Random projections

论文评审过程:Received 15 December 2005, Revised 18 December 2006, Accepted 4 January 2007, Available online 2 February 2007.

论文官网地址:https://doi.org/10.1016/j.patcog.2007.01.020