Framework for syntactic string similarity measures
作者:
Highlights:
• Token-level measures outperform character-level measures when the order of the words varies.
• Q-grams provide a good compromise between token- and character-level measures.
• Token-level measures are significantly outperformed by their soft variants.
• Soft measures based on set-matching methods perform best when using q-gram at the character level.
• The performance of similarity measures varies depending on the type of the datasets.
摘要
•Token-level measures outperform character-level measures when the order of the words varies.•Q-grams provide a good compromise between token- and character-level measures.•Token-level measures are significantly outperformed by their soft variants.•Soft measures based on set-matching methods perform best when using q-gram at the character level.•The performance of similarity measures varies depending on the type of the datasets.
论文关键词:Similarity measure,String similarity,Information retrieval,Text processing
论文评审过程:Received 17 September 2018, Revised 4 March 2019, Accepted 27 March 2019, Available online 2 April 2019, Version of Record 10 April 2019.
论文官网地址:https://doi.org/10.1016/j.eswa.2019.03.048