Online clustering of parallel data streams

作者：

Highlights：

•

摘要

In recent years, the management and processing of so-called data streams has become a topic of active research in several fields of computer science such as, e.g., distributed systems, database systems, and data mining. A data stream can roughly be thought of as a transient, continuously increasing sequence of time-stamped data. In this paper, we consider the problem of clustering parallel streams of real-valued data, that is to say, continuously evolving time series. In other words, we are interested in grouping data streams the evolution over time of which is similar in a specific sense. In order to maintain an up-to-date clustering structure, it is necessary to analyze the incoming data in an online manner, tolerating not more than a constant time delay. For this purpose, we develop an efficient online version of the classical K-means clustering algorithm. Our method’s efficiency is mainly due to a scalable online transformation of the original data which allows for a fast computation of approximate distances between streams.

论文关键词：Data mining,Clustering,Data streams,Fuzzy sets

论文评审过程：Received 10 March 2005, Accepted 25 May 2005, Available online 24 June 2005.

论文官网地址：https://doi.org/10.1016/j.datak.2005.05.009