Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities

作者：

摘要

Owing to the need for a deep understanding of linguistic items, semantic representation is considered to be one of the fundamental components of several applications in Natural Language Processing and Artificial Intelligence. As a result, semantic representation has been one of the prominent research areas in lexical semantics over the past decades. However, due mainly to the lack of large sense-annotated corpora, most existing representation techniques are limited to the lexical level and thus cannot be effectively applied to individual word senses. In this paper we put forward a novel multilingual vector representation, called Nasari, which not only enables accurate representation of word senses in different languages, but it also provides two main advantages over existing approaches: (1) high coverage, including both concepts and named entities, (2) comparability across languages and linguistic levels (i.e., words, senses and concepts), thanks to the representation of linguistic items in a single unified semantic space and in a joint embedded space, respectively. Moreover, our representations are flexible, can be applied to multiple applications and are freely available at http://lcl.uniroma1.it/nasari/. As evaluation benchmark, we opted for four different tasks, namely, word similarity, sense clustering, domain labeling, and Word Sense Disambiguation, for each of which we report state-of-the-art performance on several standard datasets across different languages.

论文关键词：Semantic representation,Lexical semantics,Word Sense Disambiguation,Semantic similarity,Sense clustering,Domain labeling

论文评审过程：Received 23 December 2015, Revised 14 July 2016, Accepted 25 July 2016, Available online 16 August 2016, Version of Record 26 August 2016.

论文官网地址：https://doi.org/10.1016/j.artint.2016.07.005