Mixhead: Breaking the low-rank bottleneck in multi-head attention language models

作者:

Highlights:

摘要

The Transformer-based models have achieved significant advances in language modeling, while the multi-head attention mechanism in Transformers plays an indispensable part in their success. However, the too-small head size caused by the multi-head mechanism will lead to one problem called the low-rank bottleneck, which means that the rank of the attention weight matrix is too small to represent any desired attention. Naively increasing the head size is insufficient to solve the problem because it leads to severe parameter explosion and overfitting. To tackle this problem, we propose a mix-head attention (Mixhead) which mixes multiple attention heads by learnable mixing weights to improve the expressive power of the model. In contrast, Mixhead achieves a higher rank of the attention weight matrix while introducing a negligible number of parameters. Furthermore, Mixhead is quite general and can be easily adopted to most multi-head attention based models. We conduct extensive experiments including language modeling, machine translation, and finetuning BERT to demonstrate the effectiveness of our method.

论文关键词:Language model,Multi-head attention,Low-rank bottleneck

论文评审过程:Received 3 June 2021, Revised 20 December 2021, Accepted 24 December 2021, Available online 5 January 2022, Version of Record 18 January 2022.

论文官网地址:https://doi.org/10.1016/j.knosys.2021.108075