An active learning framework for semi-supervised document clustering with language modeling

作者:

Highlights:

摘要

This paper investigates a framework that actively selects informative document pairs for obtaining user feedback for semi-supervised document clustering. A gain-directed document pair selection method that measures how much we can learn by revealing judgments of selected document pairs is designed. We use the estimation of term co-occurrence probabilities as a clue for finding informative document pairs. Term co-occurrence probabilities are considered in the semi-supervised document clustering process to capture term-to-term dependence relationships. In the semi-supervised document clustering, each cluster is represented by a language model. We have conducted extensive experiments on several real-world corpora. The results demonstrate that our proposed framework is effective.

论文关键词:Document clustering,Semi-supervised,Active learning,Language modeling

论文评审过程:Received 11 April 2007, Revised 15 July 2008, Accepted 15 August 2008, Available online 16 September 2008.

论文官网地址:https://doi.org/10.1016/j.datak.2008.08.008