Adaptive Syncretic Attention for Constrained Image Captioning

作者：Liang Yang, Haifeng Hu

摘要

Recently, deep learning approaches for image captioning have gained a lot of attention and achieved overwhelming progress. In this paper, we propose a novel model which simultaneously explores a better representation of images and the relationship between visual and semantic information. The model consists of three parts: an Adaptive Syncretic Attention (ASA) mechanism, a LSTM + MLP mimic constraint network and a multimodal layer. In the ASA, we integrate local semantics features captured by region proposal network with time-varying global visual features through attention mechanism. In the LSTM + MLP mimic constraint network, we designed a network which consists of Multilayer Perceptron (MLP) and Long Short Term Memory (LSTM) model. During test process, this network generates a Mimic Constraint Vector for each test image. Further, we combine textual and visual information in our multimodal layer. Based on these three parts, our full model is capable of both capturing meaningful local features and generating sentence that is more relevant to image content. We evaluate our model on two popular datasets (i.e., Flickr30k and MSCOCO datasets). The results show that each module can improve our model. Moreover, our entire model is on par with or even better than the state-of-the-art methods.

论文关键词：Image caption, Visual attention, Convolutional neural network, Recurrent neural network

论文评审过程：

论文官网地址：https://doi.org/10.1007/s11063-019-10045-5