The use of bigrams to enhance text categorization

作者:

Highlights:

摘要

In this paper, we present an efficient text categorization algorithm that generates bigrams selectively by looking for ones that have an especially good chance of being useful. The algorithm uses the information gain metric, combined with various frequency thresholds. The bigrams, along with unigrams, are then given as features to two different classifiers: Naı̈ve Bayes and maximum entropy. The experimental results suggest that the bigrams can substantially raise the quality of feature sets, showing increases in the break-even points and F1 measures. The McNemar test shows that in most categories the increases are very significant. Upon close examination of the algorithm, we concluded that the algorithm is most successful in correctly classifying more positive documents, but may cause more negative documents to be classified incorrectly.

论文关键词:Information retrieval,Text categorization,Text classification,Machine learning

论文评审过程:Received 19 March 2001, Accepted 6 August 2001, Available online 14 March 2002.

论文官网地址:https://doi.org/10.1016/S0306-4573(01)00045-0