An automatic extraction method of the domains of competence for learning classifiers using data complexity measures
作者:Julián Luengo, Francisco Herrera
摘要
The constant appearance of algorithms and problems in data mining makes impossible to know in advance whether the model will perform well or poorly until it is applied, which can be costly. It would be useful to have a procedure that indicates, prior to the application of the learning algorithm and without needing a comparison with other methods, whether the outcome will be good or bad using the information available in the data. In this work, we present an automatic extraction method to determine the domains of competence of a classifier using a set of data complexity measures proposed for the task of classification. These domains codify the characteristics of the problems that are suitable or not for it, relating the concepts of data geometrical structures that may be difficult and the final accuracy obtained by any classifier. In order to do so, this proposal uses 12 metrics of data complexity acting over a large benchmark of datasets in order to analyze the behavior patterns of the method, obtaining intervals of data complexity measures with good or bad performance. As a representative for classifiers to analyze the proposal, three classical but different algorithms are used: C4.5, SVM and K-NN. From these intervals, two simple rules that describe the good or bad behaviors of the classifiers mentioned each are obtained, allowing the user to characterize the response quality of the methods from a dataset’s complexity. These two rules have been validated using fresh problems, showing that they are general and accurate. Thus, it can be established when the classifier will perform well or poorly prior to its application.
论文关键词:Classification, Data complexity, Domains of competence, C4.5, Support vector machines, K-nearest neighbor
论文评审过程:
论文官网地址:https://doi.org/10.1007/s10115-013-0700-4