An improved supervised term weighting scheme for text representation and classification

作者:

Highlights:

摘要

Term weighting scheme has significant effects on the text classification performance. The main reason is that in text classification tasks, term weighting scheme determines the way in which texts are represented in the vector space model. Currently, term frequency-inverse document frequency is the most widely utilized term weighting scheme but it does not use the available category information of the training texts. Taking this resource of category information (or category factor) into account in the study, an improved supervised term weighting method for representing text is developed, which combines a new measure of information namely cumulative residual entropy and the proportional distortion function. To verify the text classification performance of our proposed scheme, we conducted an extensive experimental comparison of proposed scheme with existing schemes on two corpora (i.e., Reuters-21578 and 20 Newsgroups datasets) with different characteristics. Results explicitly show that our proposed scheme can obtain significantly better effect for text classification than others. Specifically, when linear support vector machine classifier is run, performances were improved to 0.972 and 0.833 (micro-F1) on Reuters-21578 dataset and 20 Newsgroups dataset, respectively.

论文关键词:Supervised term weighting,Text representation,Text classification,Cumulative residual entropy,Proportional distortion function

论文评审过程:Received 29 April 2020, Revised 31 August 2021, Accepted 26 September 2021, Available online 1 October 2021, Version of Record 3 November 2021.

论文官网地址:https://doi.org/10.1016/j.eswa.2021.115985