Twin labeled LDA: a supervised topic model for document classification

作者:Wei Wang, Bing Guo, Yan Shen, Han Yang, Yaosen Chen, Xinhua Suo

摘要

Recently, some statistic topic modeling approaches, e.g., Latent Dirichlet allocation (LDA), have been widely applied in the field of document classification. However, standard LDA is a completely unsupervised algorithm, and then there is growing interest in incorporating prior information into the topic modeling procedure. Some effective approaches have been developed to model different kinds of prior information, for example, observed labels, hidden labels, the correlation among labels, label frequencies; however, these methods often need heavy computing because of model complexity. In this paper, we propose a new supervised topic model for document classification problems, Twin Labeled LDA (TL-LDA), which has two sets of parallel topic modeling processes, one incorporates the prior label information by hierarchical Dirichlet distributions, the other models the grouping tags, which have prior knowledge about the label correlation; the two processes are independent from each other, so the TL-LDA can be trained efficiently by multi-thread parallel computing. Quantitative experimental results compared with state-of-the-art approaches demonstrate our model gets the best scores on both rank-based and binary prediction metrics in solving single-label classification, and gets the best scores on three metrics, i.e., One Error, Micro-F1, and Macro-F1 while multi-label classification, including non power-law and power-law datasets. The results show benefit from modeling fully prior knowledge, our model has outstanding performance and generalizability on document classification. Further comparisons with recent works also indicate the proposed model is competitive with state-of-the-art approaches.

论文关键词:Supervised, Topic modeling, Document classification, Hierarchical Dirichlet distributions

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10489-020-01798-x