An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit

作者:

Highlights:

• Document clustering with document embedding representations combined with k-means clustering delivered the best performance.

• The epochs required for optimal training of document embeddings is in general inversely proportional to the document length.

• Document clusters can be interpreted by top terms extracted from combining TF-IDF scores with word embedding similarities.

• The Adjusted Rand Index and Adjusted Mutual Information are the most appropriate extrinsic evaluation measures for clustering.

摘要

•Document clustering with document embedding representations combined with k-means clustering delivered the best performance.•The epochs required for optimal training of document embeddings is in general inversely proportional to the document length.•Document clusters can be interpreted by top terms extracted from combining TF-IDF scores with word embedding similarities.•The Adjusted Rand Index and Adjusted Mutual Information are the most appropriate extrinsic evaluation measures for clustering.

论文关键词:Document clustering,Topic modelling,Topic discovery,Embedding models,Online social networks

论文评审过程:Received 22 September 2018, Revised 11 April 2019, Accepted 11 April 2019, Available online 17 April 2019, Version of Record 13 January 2020.

论文官网地址:https://doi.org/10.1016/j.ipm.2019.04.002