Facing the reality of data stream classification: coping with scarcity of labeled data

作者:Mohammad M. Masud, Clay Woolam, Jing Gao, Latifur Khan, Jiawei Han, Kevin W. Hamlen, Nikunj C. Oza

摘要

Recent approaches for classifying data streams are mostly based on supervised learning algorithms, which can only be trained with labeled data. Manual labeling of data is both costly and time consuming. Therefore, in a real streaming environment where large volumes of data appear at a high speed, only a small fraction of the data can be labeled. Thus, only a limited number of instances will be available for training and updating the classification models, leading to poorly trained classifiers. We apply a novel technique to overcome this problem by utilizing both unlabeled and labeled instances to train and update the classification model. Each classification model is built as a collection of micro-clusters using semi-supervised clustering, and an ensemble of these models is used to classify unlabeled data. Empirical evaluation of both synthetic and real data reveals that our approach outperforms state-of-the-art stream classification algorithms that use ten times more labeled data than our approach.

论文关键词:Data stream classification, Semi-supervised clustering, Ensemble classification, Concept drift

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10115-011-0447-8