Fuzzy entropy and fuzzy support-based boosting random forests for imbalanced data

作者:Mingxue Jiang, Youlong Yang, Haiquan Qiu

摘要

Datasets with skewed class distribution bring difficulties to learning algorithms of pattern classification. The undersampling methods mostly consider the imbalanced ratio and rarely consider the distribution of the original dataset. Also, many algorithms separate the resampling of imbalanced data from classifier training, which may lead to the loss of important information and degradation of classifier performance. To address the mentioned problems, this paper proposes a boosting random forest based on fuzzy entropy and fuzzy support (FESBoost). The proposed algorithm mainly includes two parts, static undersampling and training of ensemble classifiers. First, the attenuation function and shared k-nearest neighbor algorithm are used to construct the global class entropy based on which the area where the majority samples are located is divided into a safe area and a boundary area. Second, density peak clustering (DPCA) is used to select representative samples of the safe area, and this process represents static resampling. Finally, the classifier is trained based on the boosting framework. Since the dataset is not balanced after static undersampling, before each iteration of the algorithm, data are undersampled again based on the global class entropy and the average class support. The number of undersampled samples depends on the number of iterations and the imbalance ratio. In the FESBoost algorithm, methods of static and dynamic resampling are used. Static resampling reduces the imbalance ratio of data and overlap between classes, as well as the classifier training cost. Based on data distribution and the possibility of sample misclassification, dynamic resampling updates the majority samples. The superiority of the proposed algorithm is verified experimentally on 9 synthetic datasets and 34 KEEL datasets. The proposed algorithm is also compared with seven algorithms, and the results show that the proposed algorithm has better generalization performance than other compared algorithms.

论文关键词:Fuzzy entropy, Fuzzy support, Boosting, Imbalanced data, Undersampling

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10489-021-02620-y