Classifier performance as a function of distributional complexity

摘要

When choosing a classification rule, it is important to take into account the amount of sample data available. This paper examines the performances of classifiers of differing complexities in relation to the complexity of feature-label distributions in the case of small samples. We define the distributional complexity of a feature-label distribution to be the minimal number of hyperplanes necessary to achieve the Bayes classifier if the Bayes classifier is achievable by a finite number of hyperplanes, and infinity otherwise. Our approach is to choose a model and compare classifier efficiencies for various sample sizes and distributional complexities. Simulation results are obtained by generating data based on the model and the distributional complexities. A linear support vector machine (SVM) is considered, along with several nonlinear classifiers. For the most part, we see that there is little improvement when one uses a complex classifier instead of a linear SVM. For higher levels of distributional complexity, the linear classifier degrades, but so do the more complex classifiers owing to insufficient training data. Hence, if one were to obtain a good result with a more complex classifier, it is most likely that the distributional complexity is low and there is no gain over using a linear classifier. Hence, under the model, it is generally impossible to claim that use of the nonlinear classifier is beneficial. In essence, the sample sizes are too small to take advantage of the added complexity. An exception to this observation is the behavior of the three-nearest-neighbor (3NN) classifier in the case of two variables (but not three) when there is very little overlap between the label distributions and the sample size is not too small. With a sample size of 60, the 3NN classifier performs close to the Bayes classifier, even for high levels of distributional complexity. Consequently, if one uses the 3NN classifier with two variables and obtains a low error, then the distributional complexity might be large and, if such is the case, there is a significant gain over using a linear classifier.