HESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset

作者:

Highlights:

• This work is a detailed companion reproducibility paper of the methods and experiments proposed in three previous works by Lastra-Díaz and García-Serrano, which introduce a set of reproducible experiments on word similarity based on HESML and ReproZip with the aim of exactly reproducing the experimental surveys in the aforementioned works.

• This work introduces a new representation model for taxonomies called PosetHERep, and a Java software library called Half-Edge Semantic Measures Library (HESML) based on it, which implements most ontology-based semantic similarity measures and Information Content (IC) models based on WordNet reported in the literature.

• PosetHERep proposes a memory-efficient representation for taxonomies which linearly scales with the size of the taxonomy and provides an efficient implementation of a large set of topological queries and graph-based algorithms, which is an adaptation of the half-edge data structure commonly used to represent discrete manifolds and planar graphs in computational geometry.

• This work also introduces a replication framework and dataset, called WNSimRep v1, which is provided as supplementary material and whose aim is to assist the exact replication of most similarity measures and IC models reported in the literature.

• Finally, this work introduces an experimental survey on the performance and scalability of the most recent state-of-the-art semantic measures libraries. This latter experimental survey confirms the statistically significant outperformance of HESML on the state-of-the-art libraries in terms of performance and scalability, as well as the possibility to improve significantly the performance and scalability of the semantic measures libraries without caching using PosetHERep.

摘要

•This work is a detailed companion reproducibility paper of the methods and experiments proposed in three previous works by Lastra-Díaz and García-Serrano, which introduce a set of reproducible experiments on word similarity based on HESML and ReproZip with the aim of exactly reproducing the experimental surveys in the aforementioned works.•This work introduces a new representation model for taxonomies called PosetHERep, and a Java software library called Half-Edge Semantic Measures Library (HESML) based on it, which implements most ontology-based semantic similarity measures and Information Content (IC) models based on WordNet reported in the literature.•PosetHERep proposes a memory-efficient representation for taxonomies which linearly scales with the size of the taxonomy and provides an efficient implementation of a large set of topological queries and graph-based algorithms, which is an adaptation of the half-edge data structure commonly used to represent discrete manifolds and planar graphs in computational geometry.•This work also introduces a replication framework and dataset, called WNSimRep v1, which is provided as supplementary material and whose aim is to assist the exact replication of most similarity measures and IC models reported in the literature.•Finally, this work introduces an experimental survey on the performance and scalability of the most recent state-of-the-art semantic measures libraries. This latter experimental survey confirms the statistically significant outperformance of HESML on the state-of-the-art libraries in terms of performance and scalability, as well as the possibility to improve significantly the performance and scalability of the semantic measures libraries without caching using PosetHERep.

论文关键词:HESML,PosetHERep,Semantic measures library,Ontology-based semantic similarity measures,Intrinsic and corpus-based Information Content models,Reproducible experiments on word similarity,WNSimRep v1 dataset,ReproZip,WordNet-based semantic similarity measures

论文评审过程:Received 26 July 2016, Revised 6 February 2017, Accepted 10 February 2017, Available online 21 February 2017, Version of Record 22 March 2017.

论文官网地址:https://doi.org/10.1016/j.is.2017.02.002