Histogram-based clustering of multiple data streams
作者:Antonio Balzanella, Rosanna Verde
摘要
This paper introduces a strategy for clustering online multiple data streams. We assume that several sources are used for recording, over time, data about some physical phenomena. Each source provides repeated measurements at a very high frequency so that it is not possible to store the whole amount of data into some easy-to-access media, but data are available only in batches. Our aim is to discover a partition of the sources (e.g. sensors) into homogeneous clusters, analysing the incoming streams of data. The proposed strategy is based on processing the incoming data batches independently, through an initial summarization of the data batches by histograms and, then, by means of a local clustering performed on the histograms which provides a further data summarization. To keep track of the data proximities among the data streams over time, we use local clustering outputs for updating a proximity matrix. The final partitioning of the streams is obtained by a clustering based on such proximity matrix. Through an application on real and simulated data, we show the effectiveness of our strategy in finding homogeneous groups of sources of data streams.
论文关键词:Data stream mining, Histogram data, Clustering, Sensor data streams
论文评审过程:
论文官网地址:https://doi.org/10.1007/s10115-019-01350-5