Novel feature selection method based on harmony search for email classification
作者:
Highlights:
•
摘要
Feature selection is often used in email classification to reduce the dimensionality of the feature space. In this study, a new document frequency and term frequency combined feature selection method (DTFS) is proposed to improve the performance of email classification. Firstly, an existing optimal document frequency based feature selection method (ODFFS) and a predetermined threshold are applied to select the most discriminative features. Secondly, an existing optimal term frequency based feature selection (OTFFS) method and another predetermined threshold are applied to select more discriminative features. Finally, ODFFS and OTFFS are combined to select the remaining features. In order to improve the convergence rate of parameter optimization, a metaheuristic method, namely global best harmony oriented harmony search (GBHS), is proposed to search these optimal predetermined thresholds. Experiments with fuzzy Support Vector Machine (FSVM) and Naïve Bayesian (NB) classifiers are applied on six corpuses: PU2, CSDMC2010, PU3, Lingspam, Enron-spam and Trec2007. Experimental results show that, DTFS outperforms other methods: such as Chi-squre, comprehensively measure feature selection, t-test based feature selection, term frequency based information gain, two-step based hybrid feature selection method and improved term frequency inverse document frequency method on six corpuses.
论文关键词:Feature selection,Document frequency,Term frequency,Parameter optimization,Harmony search
论文评审过程:Received 22 February 2014, Revised 14 October 2014, Accepted 15 October 2014, Available online 23 October 2014.
论文官网地址:https://doi.org/10.1016/j.knosys.2014.10.013