A novel progressively undersampling method based on the density peaks sequence for imbalanced data

作者:

Highlights:

摘要

Undersampling is a widely used resampling technique for imbalanced data. As traditional undersampling techniques, typically making majority and minority classes in imbalanced data into the same scale, tend to miss valuable information, many strategies like clustering have been developed. However, two essential problems still remain and require more efforts to be put; that is, which and how many instances should be extracted in undersampling. To alleviate these two problems, in this paper we propose a novel undersampling method for imbalanced data. It exploits a sequence of density peaks to progressively extract instances from the majority classes of the imbalanced data. Specifically, two factors are introduced to measure the importance degree of each instance in the majority classes. With these two factors, we generate a sampling sequence based on the importance of instances for classification. Furthermore, the optimal undersampling size of the majority classes is automatically determined by progressively extracting the important instances from the sequence. To evaluate the effectiveness of the proposed method, a series of experiments comparing to six popular undersampling methods were conducted on 40 public benchmark datasets. The experimental results show that the performance of the proposed undersampling method is superior to the state-of-the-art undersampling methods.

论文关键词:Progressive undersampling,Density peaks sequence,Importance degree,Optimal undersampling size,Imbalanced data

论文评审过程:Received 3 August 2020, Revised 13 December 2020, Accepted 15 December 2020, Available online 27 December 2020, Version of Record 27 December 2020.

论文官网地址:https://doi.org/10.1016/j.knosys.2020.106689