Towards accurate predictors of word quality for Machine Translation: Lessons learned on French–English and English–Spanish systems

作者:

Highlights:

摘要

This paper proposes some ideas to build effective estimators, which predict the quality of words in a Machine Translation (MT) output. We propose a number of novel features of various types (system-based, lexical, syntactic and semantic) and then integrate them into the conventional (previously used) feature set, for our baseline classifier training. The classifiers are built over two different bilingual corpora: French–English (fr–en) and English–Spanish (en–es). After the experiments with all features, we deploy a “Feature Selection” strategy to filter the best performing ones. Then, a method that combines multiple “weak” classifiers to constitute a strong “composite” classifier by taking advantage of their complementarity allows us to achieve a significant improvement in terms of F-score, for both fr–en and en–es systems. Finally, we exploit word confidence scores for improving the quality estimation system at sentence level.

论文关键词:Machine Translation,Confidence measure,Confidence Estimation,Conditional Random Fields,Boosting

论文评审过程:Available online 11 April 2015, Version of Record 28 May 2015.

论文官网地址:https://doi.org/10.1016/j.datak.2015.04.003