Image captioning with adaptive incremental global context attention

作者:Changzhi Wang, Xiaodong Gu

摘要

The encoder-decoder framework has proliferated in current image captioning task, where the decoder generates target description word by word based on the preceding captions. However, this framework encounters two main concerns. Firstly, the decoder cannot adequately capture global dependencies between the current predicted target word and all the previously generated words. Secondly, some generated words (e.g., “on”, “the” and “of”) provide insufficient information, which may deviate from the sentence semantics during gradually generating captions. To address above concerns, in this paper we propose a novel adaptive incremental global context attention (IGCA) method to capture the global information between target words, thus enhancing target word predictions in image captioning. Specifically, all of previous historical decoder hidden states are utilized as the global feature to guide the generation of subsequent word. During the generation procedure, the proposed IGCA mechanism is able to dynamically focus on these text features that are most correlated with the currently generated word. To verify the efficiency of our IGCA model, we conducted extensive experiments on the three public benchmark datasets. The experimental results demonstrate that the proposed model brings significant improvement over the conventional attention-based encoder-decoder methods and achieves state-of-the-art performance on Flick 30k and Flick 8k datasets.

论文关键词:Image captioning, Attention mechanism, Incremental global context attention, LSTM

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10489-021-02734-3