Direct estimation of class membership probabilities for multiclass classification using multiple scores

作者:Kazuko Takahashi, Hiroya Takamura, Manabu Okumura

摘要

Accurate estimation of class membership probability is needed for many applications in data mining and decision-making, to which multiclass classification is often applied. Since existing methods for estimation of class membership probability are designed for binary classification, in which only a single score outputted from a classifier can be used, an approach for multiclass classification requires both a decomposition of a multiclass classifier into binary classifiers and a combination of estimates obtained from each binary classifier to a target estimate. We propose a simple and general method for directly estimating class membership probability for any class in multiclass classification without decomposition and combination, using multiple scores not only for a predicted class but also for other proper classes. To make it possible to use multiple scores, we propose to modify or extend representative existing methods. As a non-parametric method, which refers to the idea of a binning method as proposed by Zadrozny et al., we create an “accuracy table” by a different method. Moreover we smooth accuracies on the table with methods such as the moving average to yield reliable probabilities (accuracies). As a parametric method, we extend Platt’s method to apply a multiple logistic regression. On two different datasets (open-ended data from Japanese social surveys and the 20 Newsgroups) both with Support Vector Machines and naive Bayes classifiers, we empirically show that the use of multiple scores is effective in the estimation of class membership probabilities in multiclass classification in terms of cross entropy, the reliability diagram, the ROC curve and AUC (area under the ROC curve), and that the proposed smoothing method for the accuracy table works quite well. Finally, we show empirically that in terms of MSE (mean squared error), our best proposed method is superior to an expansion for multiclass classification of a PAV method proposed by Zadrozny et al., in both the 20 Newsgroups dataset and the Pendigits dataset, but is slightly worse than the state-of-the-art method, which is an expansion for multiclass classification of a combination of boosting and a PAV method, on the Pendigits dataset.

论文关键词:Multiclass classification, Class membership probabilities, Accuarcy table, Logistic regression, Direct estimation, Multiple classification scores

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10115-008-0165-z