Optimal convex error estimators for classification

摘要

A cross-validation error estimator is obtained by repeatedly leaving out some data points, deriving classifiers on the remaining points, computing errors for these classifiers on the left-out points, and then averaging these errors. The 0.632 bootstrap estimator is obtained by averaging the errors of classifiers designed from points drawn with replacement and then taking a convex combination of this “zero bootstrap” error with the resubstitution error for the designed classifier. This gives a convex combination of the low-biased resubstitution and the high-biased zero bootstrap. Another convex error estimator suggested in the literature is the unweighted average of resubstitution and cross-validation. This paper treats the following question: Given a feature-label distribution and classification rule, what is the optimal convex combination of two error estimators, i.e. what are the optimal weights for the convex combination. This problem is considered by finding the weights to minimize the MSE of a convex estimator. It also considers optimality under the constraint that the resulting estimator be unbiased. Owing to the large amount of results coming from the various feature-label models and error estimators, a portion of the results are presented herein and the main body of results appears on a companion website. In the tabulated results, each table treats the classification rules considered for the model, various Bayes errors, and various sample sizes. Each table includes the optimal weights, mean errors and standard deviations for the relevant error measures, and the MSE and MAE for the optimal convex estimator. Many observations can be made by considering the full set of experiments. Some general trends are outlined in the paper. The general conclusion is that optimizing the weights of a convex estimator can provide substantial improvement, depending on the classification rule, data model, sample size and component estimators. Optimal convex bootstrap estimators are applied to feature-set ranking to illustrate their potential advantage over non-optimized convex estimators.