Is neural always better? SMT versus NMT for Dutch text normalization
作者:
Highlights:
• The Statistical Machine Translation approach achieved the best results.
• Data augmentations increased the number of errors for the neural approach.
• Byte-pair encoding is not viable when dealing with low-resources environments.
• The CopyNet algorithm drastically reduces the number of overnormalizations.
• The CopyNet algorithm is capable of correcting around 87% of the required edits.
摘要
•The Statistical Machine Translation approach achieved the best results.•Data augmentations increased the number of errors for the neural approach.•Byte-pair encoding is not viable when dealing with low-resources environments.•The CopyNet algorithm drastically reduces the number of overnormalizations.•The CopyNet algorithm is capable of correcting around 87% of the required edits.
论文关键词:Normalization,Low-resource,SMT,NMT,Over-normalization,CopyNet
论文评审过程:Received 20 August 2020, Revised 14 November 2020, Accepted 16 December 2020, Available online 24 December 2020, Version of Record 15 January 2021.
论文官网地址:https://doi.org/10.1016/j.eswa.2020.114500