Contextual feature selection for text classification

作者:

Highlights:

摘要

We present a simple approach for the classification of “noisy” documents using bigrams and named entities. The approach combines conventional feature selection with a contextual approach to filter out passages around selected features. Originally designed for call for tender documents, the method can be useful for other web collections that also contain non-topical contents. Experiments are conducted on our in-house collection as well as on the 4-Universities data set, Reuters 21578 and 20 Newsgroups. We find a significant improvement on our collection and the 4-Universities data set (10.9% and 4.1%, respectively). Although the best results are obtained by combining bigrams and named entities, the impact of the latter is not found to be significant.

论文关键词:Classification,Named entities,Feature selection,Text filtering

论文评审过程:Received 28 May 2006, Accepted 25 July 2006, Available online 24 October 2006.

论文官网地址:https://doi.org/10.1016/j.ipm.2006.07.006