Effects of data set features on the performances of classification algorithms

作者:

Highlights:

摘要

As the need to analyze big data sets grows dramatically, the role that classification algorithms play in data mining techniques also increases. Big data analysis requires more of the data sets’ characteristics to be included, such as data structure, variety of sources, and the rate of update frequency. In this paper, we evaluate scenarios that examine which data set characteristics most affect the classification algorithms’ performance. It is still a complex issue to determine which algorithm is how strong or how weak in relation to which data set. Thus, our research experimentally examines how data set characteristics affect algorithm performance, both in terms of accuracy and in elapsed time. To do so, we use a multiple regression method to evaluate the causality between data set characteristics as independent variables, and performance metrics as dependent variables. We also examine the role that classification algorithms play as moderator in this causality. All benchmark data sets in a UCI database are used that are fit to run the classification algorithm. Based on the results of the experiment, we discuss the requirements of legacy classification algorithms to address big data analysis in a new business intelligence era.

论文关键词:Big data,Classification algorithms,Performance evaluation,Data mining

论文评审过程:Available online 27 September 2012.

论文官网地址:https://doi.org/10.1016/j.eswa.2012.09.017