Reducing explicit semantic representation vectors using Latent Dirichlet Allocation

作者：

Highlights：

•

摘要

Explicit Semantic Analysis (ESA) is a knowledge-based method which builds the semantic representation of the words depending on the textual description of the concepts in the certain knowledge source. Due to its simplicity and success, ESA has received wide attention from researchers in the computational linguistics and information retrieval. However, the representation vectors formed by ESA method are generally very excessive, high dimensional, and may contain many redundant concepts. In this paper, we introduce a reduced semantic representation method that constructs the semantic interpretation of the words as the vectors over the latent topics from the original ESA representation vectors. For modeling the latent topics, the Latent Dirichlet Allocation (LDA) is adapted to the ESA vectors for extracting the topics as the probability distributions over the concepts rather than the words in the traditional model. The proposed method is applied to the wide knowledge sources used in the computational semantic analysis: WordNet and Wikipedia. For evaluation, we use the proposed method in two natural language processing tasks: measuring the semantic relatedness between words/texts and text clustering. The experimental results indicate that the proposed method overcomes the limitations of the representation of the ESA method.

论文关键词：Semantic representation,Explicit Semantic Analysis,Topic modeling,Knowledge-based method

论文评审过程：Received 18 February 2015, Revised 3 March 2016, Accepted 4 March 2016, Available online 10 March 2016, Version of Record 2 April 2016.

论文官网地址：https://doi.org/10.1016/j.knosys.2016.03.002