Is neural always better? SMT versus NMT for Dutch text normalization

作者:

Highlights:

• The Statistical Machine Translation approach achieved the best results.

• Data augmentations increased the number of errors for the neural approach.

• Byte-pair encoding is not viable when dealing with low-resources environments.

• The CopyNet algorithm drastically reduces the number of overnormalizations.

• The CopyNet algorithm is capable of correcting around 87% of the required edits.

摘要

•The Statistical Machine Translation approach achieved the best results.•Data augmentations increased the number of errors for the neural approach.•Byte-pair encoding is not viable when dealing with low-resources environments.•The CopyNet algorithm drastically reduces the number of overnormalizations.•The CopyNet algorithm is capable of correcting around 87% of the required edits.

论文关键词:Normalization,Low-resource,SMT,NMT,Over-normalization,CopyNet

论文评审过程:Received 20 August 2020, Revised 14 November 2020, Accepted 16 December 2020, Available online 24 December 2020, Version of Record 15 January 2021.

论文官网地址:https://doi.org/10.1016/j.eswa.2020.114500