Topic discovery in massive text corpora based on Min-Hashing

作者:

Highlights:

• A new topic discovery approach based on Min-Hashing.

• The approach can handle massive text corpora and large vocabularies with modest computer resources.

• The number of topics does not need to be provided beforehand.

• The topics are as coherent as those discovered by Online LDA.

• The proposed approach is considerably faster than Online LDA.

摘要

•A new topic discovery approach based on Min-Hashing.•The approach can handle massive text corpora and large vocabularies with modest computer resources.•The number of topics does not need to be provided beforehand.•The topics are as coherent as those discovered by Online LDA.•The proposed approach is considerably faster than Online LDA.

论文关键词:Topic discovery,Sampled Min-Hashing,Beyond-pairwise,Co-occurring words,Large-scale

论文评审过程:Received 15 July 2018, Revised 12 June 2019, Accepted 12 June 2019, Available online 13 June 2019, Version of Record 21 June 2019.

论文官网地址:https://doi.org/10.1016/j.eswa.2019.06.024