Bridging Local and Global Data Cleansing: Identifying Class Noise in Large, Distributed Data Datasets

作者:XINGQUAN ZHU, XINDONG WU, QIJUN CHEN

摘要

To cleanse mislabeled examples from a training dataset for efficient and effective induction, most existing approaches adopt a major set oriented scheme: the training dataset is separated into two parts (a major set and a minor set). The classifiers learned from the major set are used to identify noise in the minor set. The obvious drawbacks of such a scheme are twofold: (1) when the underlying data volume keeps growing, it would be either physically impossible or time consuming to load the major set into the memory for inductive learning; and (2) for multiple or distributed datasets, it can be either technically infeasible or factitiously forbidden to download data from other sites (for security or privacy reasons). Therefore, these approaches have severe limitations in conducting effective global data cleansing from large, distributed datasets.

论文关键词:data cleansing, class noise, machine learning

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10618-005-0012-8