TopicBank: Collection of coherent topics using multiple model training with their further use for topic model validation

作者:

Highlights:

摘要

Probabilistic topic modeling of a text collection is a tool for unsupervised learning of the inherent thematic structure of the collection. Given only the text of documents as input, the topic model aims to reveal latent topics as probability distributions over words. The shortcomings of topic models are that they are unstable in the sense that topics may depend on the random initialization, and incomplete in the sense that each new run of the model on the same collection may discover some new topics. This means that data exploration using topic modeling usually requires too many experiments for looking over many topic models and tuning their parameters in search of a model that describes the data best. To deal with the instability and incompleteness of topic models, we propose to gradually accumulate interpretable topics in a “topic bank” using multiple model training. To add topics into the bank, we learn a child level in a hierarchical topic model, then we analyze the coherence of child subtopics and their relationships with parent bank topics in order to exclude irrelevant and duplicate subtopics instead of adding them to the bank. Then we introduce a new way to topic model evaluation by comparing the topics found by the model with the ones that were collected beforehand in a bank. Our experiments with several datasets and topic models show that the proposed method does help in finding a model with more interpretable topics.

论文关键词:Topic modeling,Multiple model training,Topic coherence,Stability,Regularization

论文评审过程:Received 29 September 2020, Revised 5 April 2021, Accepted 19 August 2021, Available online 30 August 2021, Version of Record 20 September 2021.

论文官网地址:https://doi.org/10.1016/j.datak.2021.101921