SpectralCAT: Categorical spectral clustering of numerical and nominal data

作者：

Highlights：

•

摘要

Data clustering is a common technique for data analysis, which is used in many fields, including machine learning, data mining, customer segmentation, trend analysis, pattern recognition and image analysis. Although many clustering algorithms have been proposed, most of them deal with clustering of one data type (numerical or nominal) or with mix data type (numerical and nominal) and only few of them provide a generic method that clusters all types of data. It is required for most real-world applications data to handle both feature types and their mix. In this paper, we propose an automated technique, called SpectralCAT, for unsupervised clustering of high-dimensional data that contains numerical or nominal or mix of attributes. We suggest to automatically transform the high-dimensional input data into categorical values. This is done by discovering the optimal transformation according to the Calinski–Harabasz index for each feature and attribute in the dataset. Then, a method for spectral clustering via dimensionality reduction of the transformed data is applied. This is achieved by automatic non-linear transformations, which identify geometric patterns in the data, and find the connections among them while projecting them onto low-dimensional spaces. We compare our method to several clustering algorithms using 16 public datasets from different domains and types. The experiments demonstrate that our method outperforms in most cases these algorithms.

论文关键词：Spectral clustering,Diffusion Maps,Categorical data clustering,Dimensionality reduction

论文评审过程：Received 11 October 2010, Revised 2 July 2011, Accepted 3 July 2011, Available online 18 July 2011.

论文官网地址：https://doi.org/10.1016/j.patcog.2011.07.006