Taxonomic data integration from multilingual Wikipedia editions

作者:Gerard de Melo, Gerhard Weikum

摘要

Information systems are increasingly making use of taxonomic knowledge about words and entities. A taxonomic knowledge base may reveal that the Lago di Garda is a lake and that lakes as well as ponds, reservoirs, and marshes are all bodies of water. As the number of available taxonomic knowledge sources grows, there is a need for techniques to integrate such data into combined, unified taxonomies. In particular, the Wikipedia encyclopedia has been used by a number of projects, but its multilingual nature has largely been neglected. This paper investigates how entities from all editions of Wikipedia as well as WordNet can be integrated into a single coherent taxonomic class hierarchy. We rely on linking heuristics to discover potential taxonomic relationships, graph partitioning to form consistent equivalence classes of entities, and a Markov chain-based ranking approach to construct the final taxonomy. This results in MENTA (Multilingual Entity Taxonomy), a resource that describes 5.4 million entities and is one of the largest multilingual lexical knowledge bases currently available.

论文关键词:Taxonomy induction, Multilingual, Graph, Ranking

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10115-012-0597-3