Efficiently tracing clusters over high-dimensional on-line data streams

作者:

Highlights:

摘要

A good clustering method should provide flexible scalability on the number of dimensions as well as the size of a data set. This paper proposes a method of efficiently tracing the clusters of a high-dimensional on-line data stream. While tracing the one-dimensional clusters of each dimension independently, a technique which is similar to frequent itemset mining is employed to find the set of multi-dimensional clusters. By finding a frequently co-occurred set of one-dimensional clusters, it is possible to trace a multi-dimensional rectangular space whose range is defined by the one-dimensional clusters collectively. In order to trace such candidates over a multi-dimensional online data stream, a cluster-statistics tree (CS-Tree) is proposed in this paper. A k-depth node(k ⩽ d) in the CS-tree is corresponding to a k-dimensional rectangular space. Each node keeps track of the density of data elements in its corresponding rectangular space. Only a node corresponding to a dense rectangular space is allowed to have a child node. The scalability on the number of dimensions is greatly enhanced while sacrificing the accuracy of identified clusters slightly.

论文关键词:Data stream,Clustering,High-dimensional data,Grid-based clustering,CS-tree

论文评审过程:Received 29 March 2007, Revised 11 July 2008, Accepted 14 November 2008, Available online 1 January 2009.

论文官网地址:https://doi.org/10.1016/j.datak.2008.11.004