Simplification of English and Bengali Sentences for Improving Quality of Machine Translation

作者:Sainik Kumar Mahata, Avishek Garain, Dipankar Das, Sivaji Bandyopadhyay

摘要

Training translation systems with complex and compound sentences are generally considered computationally tough and such systems fail to process, the large syntactical information given out by these sentences. This issue subsequently, affects the overall quality of translations. On the other hand, simple sentences are shorter by nature and produce less syntactical information. Therefore, it would be safe to say that training translation systems using simple sentences only, would result in better translation output. However, training a translation system requires a large and quality parallel corpus involving two natural languages. While parallel corpus for various language pairs is abundant, such lexicons for low-resourced languages, consisting of only simple sentences are rare. In such a scenario, the development of such a parallel lexicon is the initial purpose of the present work. Building the same would require differentiating complex and/or compound sentences from the overall corpus and then converting them into simple sentences. Since, the work includes two languages, English and Bengali, different algorithms to accomplish the same, is documented in this paper. Converting complex and compound sentences to simple instances results in fragmenting sentences into two or more segments, which then needs to be aligned to make them semantically similar. Hence, a basic alignment technique has also been proposed to mitigate this problem. After developing the parallel corpus, we needed to check for its effectiveness in solving the quality issues of translation systems discussed earlier. For this, state-of-the-art translation modules like Statistical Machine Translation and Neural Machine Translation, have been trained using the developed corpus as well as with the raw parallel corpus consisting of sentences of mixed complexities. The performance of these translation models has been compared using automated as well as manual evaluation metrics. The results are promising and prove that translation systems do perform better when trained using simple sentence language pairs.

论文关键词:

论文评审过程:

论文官网地址:https://doi.org/10.1007/s11063-022-10755-3