Visual Skeleton and Reparative Attention for Part-of-Speech image captioning system

作者:

Highlights:

摘要

In this paper, we propose a novel system for image caption generation that can adapt the language models for word generation to specific syntactic structure of sentences and visual skeleton of image. Moreover, it is capable of capturing the attentive regions of image corresponding to the predicting words. Our system consists of three modules: Visual Skeleton Vector (VSV) generation module, Visual Reparative Attention (VRA) mechanism and Part-of-Speech (POS) language model. We adopt Faster R-CNN to generate an objective sentence, which can reflect the visual skeleton (the salient objects and their relations) of an image. And we propose an encode–decode model to the procedure of generating VSV for each image, which is used to initialize our language model. VRA is a two-step strategy to analyze what and where our visual perception system should attend to. Besides, we first adopt the Stanford Parser to work out the syntactic structure of sentences, and explicitly utilize this syntactic information to our POS language model. Each of these modules can enhance the relationships between textual and visual information. The effectiveness of our entire system is verified on two classical datasets (MSCOCO and Flickr30k). Our system is on par with or better than compared state-of-the-art published methods and achieve superior performance on COCO captioning Leaderboard.

论文关键词:

论文评审过程:Received 3 November 2018, Revised 15 April 2019, Accepted 9 September 2019, Available online 30 September 2019, Version of Record 1 November 2019.

论文官网地址:https://doi.org/10.1016/j.cviu.2019.102819