TESC: An approach to TExt classification using Semi-supervised Clustering

作者:

Highlights:

摘要

This paper proposes an approach called TESC (TExt classification using Semi-supervised Clustering) to improve text classification. The basic idea is to regard one category of texts from one or more than one components. Thus, we use clustering to identify the components in text collection. In clustering process, TESC makes use of labeled texts to capture silhouettes of text clusters and unlabeled texts to adapt its centroids. The category of each text cluster is labeled by the label of texts in it. When a new unlabeled text is incoming, we measure its similarity with the text clusters and give its label with that of the nearest text clusters. Experiments on Reuters-21578 and TanCorp V1.0 text collection demonstrate that, in text classification, TESC outperforms Support Vector Machines (SVMs) and back propagation neural network (BPNN), and produces comparable performance to naïve Bayes with EM (Expectation Maximization) however with lower computation complexity.

论文关键词:Text classification,Semi-supervised clustering,Unlabeled data,Support vector machines,Expectation maximization

论文评审过程:Received 17 February 2014, Revised 24 November 2014, Accepted 25 November 2014, Available online 8 December 2014.

论文官网地址:https://doi.org/10.1016/j.knosys.2014.11.028