Topic model with incremental vocabulary based on Belief Propagation

作者：

Highlights：

•

摘要

Most of the LDA algorithms make the same limiting assumption based on a fixed vocabulary. When these algorithms process data streams in real time, the non-existent words in the vocabulary are discounted. Unexpected words that appear in the streams are incapable to be processed, as the atoms in the Dirichlet distribution are fixed. In order to address the drawbacks as mentioned above, ivLDA with topic–word distribution stemming from the Dirichlet process that has infinite atoms instead of Dirichlet distribution is proposed. ivLDA involves an incremental vocabulary that enables the topic models to process data streams. Besides, two methods are presented to manage the indices of the words, namely, ivLDA-Perp and ivLDA-PMI. ivLDA-Perp is capable of achieving high accuracy and ivLDA-PMI is able to identify the most valuable words to represent the topic. As indicated by experiments, ivLDA-Perp and ivLDA-PMI can achieve superior performance to infvoc-LDA and other state-of-the-art algorithms with fixed vocabulary.

论文关键词：Topic model,Belief Propagation,Stick-breaking process,Online algorithm

论文评审过程：Received 26 October 2018, Revised 24 June 2019, Accepted 24 June 2019, Available online 31 July 2019, Version of Record 9 September 2019.

论文官网地址：https://doi.org/10.1016/j.knosys.2019.06.020