Projected-prototype based classifier for text categorization

作者：

Highlights：

•

摘要

Currently, the explosive increasing of data stimulates a greater demand for text categorization. The existing prototype-based classifiers, including k-NN, kNNModel and Centroid classifier, are receiving wide interest from the text mining community because of their simplicity and efficiency. However, they usually perform less effectively on document data sets due to high dimensionality and complex class structures these sets involve. In most cases a single document category actually contains multiple subtopics, indicating that the documents in the same class may comprise multiple subclasses, each associated with its individual term subspace. In this paper, a novel projected-prototype based classifier is proposed for text categorization, in which a document category is represented by a set of prototypes, each assembling a representative for the documents in a subclass and its corresponding term subspace. In the classifier’s training process, the number of prototypes and the prototypes themselves are learned using a newly developed feature-weighting algorithm, in order to ensure that the documents belonging to different subclasses are separated as much as possible when projected onto their own subspaces. Then, in the testing process, each test document is classified in terms of its weighted distances from the different prototypes. Experimental results on the Reuters-21578 and 20-Newsgroups corpora show that the proposed classifier based on the multi-representative-dependent projection method can achieve higher classification accuracy at a lower computational cost than the conventional prototype-based classifiers, especially for data sets that include overlapping document categories.

论文关键词：Text categorization,Projection,Multi-representative,Prototype,Feature-weighting

论文评审过程：Received 3 August 2012, Revised 6 May 2013, Accepted 23 May 2013, Available online 6 June 2013.

论文官网地址：https://doi.org/10.1016/j.knosys.2013.05.013