Several alternative term weighting methods for text representation and classification

作者:

Highlights:

摘要

Text representation is one kind of hot topics which support text classification (TC) tasks. It has a substantial impact on the performance of TC. Although the most famous TF–IDF is specially designed for information retrieval rather than TC tasks, it is highly useful in the field of TC as a term weighting method to represent text contents. Inspired by the IDF part of TF–IDF which is defined as the logarithmic transformation, we proposed several alternative methods in this study to generate unsupervised term weighting schemes that can offset the drawback confronting TF–IDF.​ Moreover, owing to TC tasks are different from information retrieval, representing test texts as a vector in an appropriate way is also essential for TC tasks, especially for supervised term weighting approaches (e.g., TF–RF), mainly due to these methods need to use category information when weighting the terms. But most of current schemes do not clearly explain how to represent test texts with their schemes. To explore this problem and seek a reasonable solution to these schemes, we analyzed a classic unsupervised term weighting method and three typical supervised term weighting methods in depth to illustrate how to represent test texts. To investigate the effectiveness of our work, three sets of experiments are designed to compare their performance. Comparisons show that our proposed methods can indeed enhance the performance of TC, and sometimes even outperform existing supervised term weighting methods.

论文关键词:Unsupervised term weighting,Supervised term weighting,Text representation,Text classification,Nonlinear transformation

论文评审过程:Received 29 April 2020, Revised 5 August 2020, Accepted 8 August 2020, Available online 14 August 2020, Version of Record 20 August 2020.

论文官网地址:https://doi.org/10.1016/j.knosys.2020.106399