Sampling scheme-based classification rule mining method using decision tree in big data environment

作者：

Highlights：

•

摘要

Obtaining comprehensible classification rules may be extremely important in many real applications such as data-driven decision-making and classification tasks. Decision-tree methods are powerful and popular tools for acquiring classification rules. However, they do not show good performance, and the base data processing methods lack strong theoretical support in big data scenarios. This study introduces a sampling scheme with and without the replacement of the implementations of decision-tree methods. This method, called sampling-based classification rule mining (SCRM), is designed to improve the adaptation and generalization ability of classification rules in a big-data environment. Sampling without replacement is conducted to refine classification rules using the concept of conflict and coverage rules, while sampling with replacement is applied to determine rule reliability; the reliability approximation property of classification rules is proved by using the law of large numbers. The effectiveness of the SCRM was evaluated and verified using seven UCI datasets. Theoretical analysis and experimental results show that SCRM is generic with good classification ability, thereby improving the classification accuracy of the rules. SCRM has a significant advantage as it provides theoretical and methodological support for the classification rule mining of big data. Therefore, the SCRM can be used in many applications.

论文关键词：Classification rules,Decision tree,Sampling,Reliability,Big data

论文评审过程：Received 3 September 2021, Revised 25 February 2022, Accepted 1 March 2022, Available online 9 March 2022, Version of Record 19 March 2022.

论文官网地址：https://doi.org/10.1016/j.knosys.2022.108522