LR-SMOTE — An improved unbalanced data set oversampling based on K-means and SVM

作者:

Highlights:

摘要

Machine learning classification algorithms are currently widely used. One of the main problems faced by classification algorithms is the problem of unbalanced data sets. Classification algorithms are not sensitive to unbalanced data sets, therefore, it is difficult to classify unbalanced data sets. There is also a problem of unbalanced data categories in the field of loose particle detection of sealed electronic components. The signals generated by internal components are always more than the signals generated by loose particles, which easily leads to misjudgment in classification. To classify unbalanced data sets more accurately, in this paper, based on the traditional oversampling SMOTE algorithm, the LR-SMOTE algorithm is proposed to make the newly generated samples close to the sample center, avoid generating outlier samples or changing the distribution of data sets. Experiments were carried out on four sets of UCI public data sets and six sets of self-built data sets. Unmodified data sets balanced by LR-SMOTE and SMOTE algorithms used random forest algorithm and support vector machine algorithm respectively. The experimental results show that the LR-SMOTE has better performance than the SMOTE algorithm in terms of G-means value, F-measure value and AUC.

论文关键词:Unbalanced data sets,SMOTE,Loose particles signal,LR-SMOTE algorithm

论文评审过程:Received 24 October 2019, Revised 1 March 2020, Accepted 29 March 2020, Available online 2 April 2020, Version of Record 16 April 2020.

论文官网地址:https://doi.org/10.1016/j.knosys.2020.105845