SW: A weighted space division framework for imbalanced problems with label noise
作者:
Highlights:
•
摘要
Imbalanced data learning is a ubiquitous challenge in data mining and machine learning. In particular, the ubiquity and inevitability of noise can exacerbate severe performance degradation. The synthetic minority oversampling technique (SMOTE) and its variants have been proposed. The core ideas of these variants are emphasizing the specific area or combining it with different noise filters; they introduce additional parameters that are difficult to optimize or rely on specific noise filters. Furthermore, SMOTE-based methods randomly select the nearest neighbor samples and perform random interpolation to synthesize new samples without considering the impact of the sample space’s chaotic degree. In this study, a framework called SW is proposed, which performs weighted sampling by calculating the sample space’s chaos. It is a general, robust and adaptive framework that copes with noisy imbalanced datasets and combines various oversampling algorithms to improve their performances. In the SW framework, the complete random forest (CRF) is introduced to divide the sample space and adaptively assign weights to distinguish and filter noisy and outlier samples. When synthesizing a new sample, the SW framework selects the seed samples’ neighbors and calculates the informed position using the derived weights, bringing the new sample closer to the safe area. Experimental results on 16 benchmark datasets and eight classic classifiers with eight pairs of representative oversampling algorithms demonstrate the SW framework’s effectiveness. The SW framework improves significantly in high-noise situations. In particular, SW-kmeans-SMOTE improved by approximately 5 % on average across all the metrics. Code and framework are available at https://github.com/dream-lm/SW_framework.
论文关键词:Label noise,Weight,Oversampling,Imbalance,Classification
论文评审过程:Received 16 October 2021, Revised 6 June 2022, Accepted 7 June 2022, Available online 13 June 2022, Version of Record 24 June 2022.
论文官网地址:https://doi.org/10.1016/j.knosys.2022.109233