Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data
作者:David M. Rocke, Jian Dai
摘要
This paper describes a clustering method for unsupervised classification of objects in large data sets. The new methodology combines the mixture likelihood approach with a sampling and subsampling strategy in order to cluster large data sets efficiently. This sampling strategy can be applied to a large variety of data mining methods to allow them to be used on very large data sets. The method is applied to the problem of automated star/galaxy classification for digital sky data and is tested using a sample from the Digitized Palomar Sky Survey (DPOSS) data. The method is quick and reliable and produces classifications comparable to previous work on these data using supervised clustering.
论文关键词:clustering algorithm, mixture likelihood, sampling, star/galaxy classification
论文评审过程:
论文官网地址:https://doi.org/10.1023/A:1022497517599