Trigonometric comparison measure: A feature selection method for text categorization

作者:

Highlights:

摘要

Text data represented using vector space model is high dimensional data since the number of words can easily grow to tens of thousands for a moderate sized dataset. It may contain lots of redundant or irrelevant features that degrade the performance of a classifier for text categorization. To address this problem, feature selection can be applied for dimensionality reduction and it aims to find a set of highly distinguishing features. Most of filter feature selection methods for text categorization are based on document frequencies in positive and negative classes. Considering only document frequencies favors terms frequently used in a larger class and ignores relative document frequencies in the classes. In this paper, we present a new filter feature selection method, named Trigonometric Comparison Measure (TCM) considering relative document frequencies. The proposed method utilizes true positive rate and false positive rate to determine a better subset of features for text categorization and prefers terms that appear only in documents of one class with high probability. In order to assign a higher rank to terms that are frequently used in one class and rarely appears in another class, TCM calculates off-axis angles of a vector represented as (tpr,fpr) and gives a larger score to terms with a small angle using sin andcos functions. The proposed method is compared with eight well-known filter feature selection methods including balanced accuracy measure (ACC2), information gain (IG), chi-squared (CHI), odds ratio (OR), Gini index (Gini), Deviation from a Poisson distribution (DP), distinguishing feature selector (DFS) and normalized difference measure (NDM) on ten datasets using the multinomial naïve Bayes and support vector machines. The experimental results show that TCM achieves significantly better performance for text categorization.

论文关键词:Feature selection,Text categorization,Text classification,Dimension reduction

论文评审过程:Received 3 March 2018, Revised 12 October 2018, Accepted 28 October 2018, Available online 12 November 2018, Version of Record 27 February 2019.

论文官网地址:https://doi.org/10.1016/j.datak.2018.10.003