Image captioning using DenseNet network and adaptive attention

作者：

Highlights：

•

摘要

Considering the image captioning problem, it is difficult to correctly extract the global features of the images. At the same time, most attention methods force each word to correspond to the image region, ignoring the phenomenon that words such as “the” in the description text cannot correspond to the image region. To address these problems, an adaptive attention model with a visual sentinel is proposed in this paper. In the encoding phase, the model introduces DenseNet to extract the global features of the image. At the same time, on each time axis, the sentinel gate is set by the adaptive attention mechanism to decide whether to use the image feature information for word generation. In the decoding phase, the long short-term memory (LSTM) network is applied as a language generation model for image captioning tasks to improve the quality of image caption generation. Experiments on the Flickr30k and COCO datasets indicate that the proposed model exhibits significant improvement in terms ofthe BLEU and METEOR evaluation criteria.

论文关键词：Image captioning,DenseNet,LSTM,Adaptive attention mechanism

论文评审过程：Received 21 June 2019, Revised 27 November 2019, Accepted 15 March 2020, Available online 19 March 2020, Version of Record 16 April 2020.

论文官网地址：https://doi.org/10.1016/j.image.2020.115836