Image Captioning with Text-Based Visual Attention

作者：Chen He, Haifeng Hu

摘要

Attention mechanisms have attracted considerable interest in image captioning due to its powerful performance. However, many visual attention models lack of considering correlation between image and textual context, which may lead to attention vectors containing irrelevant annotation vectors. In order to overcome this limitation, we propose a new text-based visual attention (TBVA) model which focuses on certain salient object automatically by eliminating the irrelevant information once given previously generated text. The proposed end-to-end caption generation model adopts the architecture of multimodal recurrent neural network. We leverage the transposed weight sharing scheme to achieve better performance by reducing the number of parameters. The effectiveness of our model is validated on MS COCO and Flickr30k. The results show that TBVA outperforms the state-of-art image captioning methods.

论文关键词：Image captioning, Multimodal recurrent neural network, Text-based visual attention, Transposed weight sharing

论文评审过程：

论文官网地址：https://doi.org/10.1007/s11063-018-9807-7