Effective spam filtering: A single-class learning and ensemble approach

作者:

Highlights:

摘要

The annoyance of spam emails increasingly plagues both individuals and organizations. In response, most of prior research investigates spam filtering as a classical text categorization task, in which training examples must include both spam (positive examples) and legitimate (negative examples) emails. However, in many spam filtering scenarios, obtaining legitimate emails for training purpose can be more difficult than collecting spam and unclassified emails. Hence, it is more appropriate to construct a classification model for spam filtering that uses positive training examples (i.e., spam) and unlabeled instances only and does not require legitimate emails as negative training examples. Several single-class learning techniques, such as PNB and PEBL, have been proposed in the literature. However, they incur inherent limitations with regard to spam filtering. In this study, we propose and develop an ensemble approach, referred to as E2, to address these limitations. Specifically, we follow the two-stage framework of PEBL but extend each stage with an ensemble strategy. The empirical evaluation results from two spam filtering corpora suggest that our proposed E2 technique generally outperforms benchmark techniques (i.e., PNB and PEBL) and exhibits more stable performance than its counterparts.

论文关键词:Spam filtering,Text categorization,Single-class learning,Ensemble approach,Learning from positive and unlabeled examples,Partially supervised classification

论文评审过程:Available online 23 June 2007.

论文官网地址:https://doi.org/10.1016/j.dss.2007.06.010