Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data

作者：David M. Rocke, Jian Dai

摘要

This paper describes a clustering method for unsupervised classification of objects in large data sets. The new methodology combines the mixture likelihood approach with a sampling and subsampling strategy in order to cluster large data sets efficiently. This sampling strategy can be applied to a large variety of data mining methods to allow them to be used on very large data sets. The method is applied to the problem of automated star/galaxy classification for digital sky data and is tested using a sample from the Digitized Palomar Sky Survey (DPOSS) data. The method is quick and reliable and produces classifications comparable to previous work on these data using supervised clustering.

论文关键词：clustering algorithm, mixture likelihood, sampling, star/galaxy classification

论文评审过程：

论文官网地址：https://doi.org/10.1023/A:1022497517599