Collaborative strategy network for spatial attention image captioning
作者:Dongming Zhou, Jing Yang, Riqiang Bao
摘要
Automatic image captioning is an interesting task that lies at the intersection of computer vision and natural language processing. Although image captioning based on reinforcement learning has made significant progress in the past few years, the problem of inconsistent evaluation indicators for training and testing remains. Reinforcement learning optimizes a single metric, and the caption generated by the model is monotonous and non-characteristics. The model cannot reflect the diversity among images. In response to the above problems, we design a novel image captioning model based on lightweight spatial attention and a generative adversarial network. The lightweight spatial attention module discards the coarse-grained approach of maximum pooling after convolution and transforms the spatial information to preserve key information in the feature map. Then, the game mechanism between the generator and the discriminator is used to optimize the evaluation metric of the model. Finally, we design a discriminator network that cooperates with reinforcement learning to update the model parameters and objectively optimize the language metric inconsistencies between the evaluation and test indicators. We verified the effectiveness of the proposed model on the MS-COCO and Flickr 30K datasets. The experimental results show that the model proposed in this paper achieves state-of-the-art results.
论文关键词:Generative adversarial network, Attention mechanism, Image captioning, Reinforcement learning
论文评审过程:
论文官网地址:https://doi.org/10.1007/s10489-021-02943-w