Deep Transformer modeling via grouping skip connection for neural machine translation

作者：

Highlights：

•

摘要

Most of the deep neural machine translation (NMT) models are based on a bottom-up feedforward fashion, in which representations in low layers construct or modulate high layers representations. We conjecture that this unidirectional encoding fashion could be a potential issue in building a deep NMT model. In this paper, we propose to build a deeper Transformer encoder by properly organizing encoder layers into multiple groups, which are connected via a grouping skip connection mechanism. Here, each group is further appropriately fed into subsequent groups to build a deep Transformer encoder. In this way, we successfully build a deep Transformer encoder with up to 48 layers. Moreover, we can share the parameters among groups to extend the encoder (virtual) depth even without introducing additional parameters. Detailed experimentation on the large-scale WMT (workshop on machine translation) 2014 English-to-German, English-to-French translation, WMT 2016 English-to-German, and WMT 2017 Chinese-to-English tasks demonstrates that our proposed deep Transformer model significantly outperforms the strong Transformer baseline. Furthermore, we carry out linguistic probing tasks to analyze the problems existing in the original Transformer model and explain how our deep Transformer encoder improves the translation quality. One particularly nice property of our approach is that it is incredibly easy to implement. We make our code available on Github https://github.com/liyc7711/deep-nmt.

论文关键词：Neural machine translation,Grouping skip connection,Deep NMT,Transformer

论文评审过程：Received 23 April 2021, Revised 28 September 2021, Accepted 30 September 2021, Available online 6 October 2021, Version of Record 21 October 2021.

论文官网地址：https://doi.org/10.1016/j.knosys.2021.107556