A preprocess algorithm of filtering irrelevant information based on the minimum class difference
作者:
Highlights:
•
摘要
Whether a word (or a feature) should be included or excluded during the process of text classification could depend on a number of factors, such as the amount of information it represents, its appearance frequency and its meaning. The application context is another important factor that needs to be considered. A word may be able to represent the characteristic of a document in one application context but may not reflect its nature in another. This paper reports on an investigation into the selection of features for classification with the consideration of the application context of the documents to be processed. A new feature selection algorithm for text classification to be known as the PBMCD algorithm is proposed. This algorithm has been implemented and tested using three different data sets. The experiment results have shown that this algorithm cannot only filter out irrelevant features before the classification process but also can increase the classification accuracy. As a comparison, experiment results with other methods have also been presented.
论文关键词:Classification,Text categorization,Feature selection,Preprocess
论文评审过程:Received 11 January 2005, Accepted 24 March 2006, Available online 28 June 2006.
论文官网地址:https://doi.org/10.1016/j.knosys.2006.03.005