Deep Captioning with Attention-Based Visual Concept Transfer Mechanism for Enriching Description

作者：Junxuan Zhang, Haifeng Hu

摘要

In this paper, we propose a novel deep captioning framework called Attention-based multimodal recurrent neural network with Visual Concept Transfer Mechanism (A-VCTM). There are three advantages of the proposed A-VCTM. (1) A multimodal layer is used to integrate the visual representation and context representation together, building a bridge that connects context information with visual information directly. (2) An attention mechanism is introduced to lead the model to focus on the regions corresponding to the next word to be generated (3) We propose a visual concept transfer mechanism to generate novel visual concepts and enrich the description sentences. Qualitative and quantitative results on two standard benchmarks, MSCOCO and Flickr30K show the effectiveness and practicability of the proposed A-VCTM framework.

论文关键词：Image captioning, Visual concepts transfer mechanism, Attention mechanism, Multimodal fusion

论文评审过程：

论文官网地址：https://doi.org/10.1007/s11063-019-09978-8