Learning document representation via topic-enhanced LSTM model
作者:
Highlights:
•
摘要
Document representation plays an important role in the fields of text mining, natural language processing, and information retrieval. Traditional approaches to document representation may suffer from the disregard of the correlations or order of words in a document, due to unrealistic assumption of word independence or exchangeability. Recently, long–short-term memory (LSTM) based recurrent neural networks have been shown effective in preserving local contextual sequential patterns of words in a document, but using the LSTM model alone may not be adequate to capture global topical semantics for learning document representation. In this work, we propose a new topic-enhanced LSTM model to deal with the document representation problem. We first employ an attention-based LSTM model to generate hidden representation of word sequence in a given document. Then, we introduce a latent topic modeling layer with similarity constraint on the local hidden representation, and build a tree-structured LSTM on top of the topic layer for generating semantic representation of the document. We evaluate our model in typical text mining applications, i.e., document classification, topic detection, information retrieval, and document clustering. Experimental results on real-world datasets show the benefit of our innovations over state-of-the-art baseline methods.
论文关键词:Document representation,Deep learning,Long–short term memory,Topic modeling
论文评审过程:Received 4 June 2018, Revised 4 March 2019, Accepted 10 March 2019, Available online 14 March 2019, Version of Record 18 April 2019.
论文官网地址:https://doi.org/10.1016/j.knosys.2019.03.007