Improving neural topic modeling via Sinkhorn divergence

作者:

Highlights:

摘要

Textual data have been a major form to convey internet users’ content. How to effectively and efficiently discover latent topics among them has essential theoretical and practical value. Recently, neural topic models(NTMs), especially Variational Auto-encoder-based NTMs, proved to be a successful approach for mining meaningful and interpretable topics. However, they usually suffer from two major issues:(1)Posterior collapse: KL divergence will rapidly reach zeros resulting in low-quality representation in latent distribution; (2)Unconstrained topic generative models: Topic generative models are always unconstrained, which potentially leads to discovering redundant topics. To address these issues, we propose Autoencoding Sinkhorn Topic Model based on Sinkhorn Auto-encoder(SAE) and Sinkhorn divergence. SAE utilizes Sinkhorn divergence rather than problematic KL divergence to optimize the difference between posterior and prior, which is free of posterior collapse. Then, to reduce topic redundancy, Sinkhorn Topic Diversity Regularization(STDR) is presented. STDR leverages the proposed Salient Topic Layer and Sinkhorn divergence for measuring distance between salient topic features and serves as a penalty term in loss function facilitating discovering diversified topics in training. Several experiments have been conducted on 2 popular datasets to verify our contribution. Experiment results demonstrate the effectiveness of the proposed model.

论文关键词:00-01,99-00,Deep learning,Topic model,Sinkhorn divergence,Auto-encoder

论文评审过程:Received 30 July 2021, Revised 8 December 2021, Accepted 30 December 2021, Available online 16 February 2022, Version of Record 16 February 2022.

论文官网地址:https://doi.org/10.1016/j.ipm.2021.102864