Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets

作者:

Highlights:

摘要

E-mail foldering or e-mail classification into user predefined folders can be viewed as a text classification/categorization problem. However, it has some intrinsic properties that make it more difficult to deal with, mainly the large cardinality of the class variable (i.e. the number of folders), the different number of e-mails per class state and the fact that this is a dynamic problem, in the sense that e-mails arrive in our mail-folders following a time-line. Perhaps because of these problems, standard text-oriented classifiers such as Naive Bayes Multinomial do no obtain a good accuracy when applied to e-mail corpora. In this paper, we identify the imbalance among classes/folders as the main problem, and propose a new method based on learning and sampling probability distributions. Our experiments over a standard corpus (ENRON) with seven datasets (e-mail users) show that the results obtained by Naive Bayes Multinomial significantly improve when applying the balancing algorithm first. For the sake of completeness in our experimental study we also compare this with another standard balancing method (SMOTE) and classifiers.

论文关键词:E-mail foldering,Text categorization,Imbalanced data,Naive Bayes multinomial,Classification

论文评审过程:Available online 5 August 2010.

论文官网地址:https://doi.org/10.1016/j.eswa.2010.07.146