Reasoning like Humans: On Dynamic Attention Prior in Image Captioning

作者:

Highlights:

摘要

Attention-based models have been widely used in image captioning. Nevertheless, most conventional deep attention models perform attention operations for each block/step independently, which neglects prior knowledge obtained by previous steps. In this paper, we propose a novel method — DYnamic Attention PRior (DY-APR), which combines both attention distribution prior and local linguistic context for caption generation. Like human beings, DY-APR can gradually shift its attention from a multitude of objects to the one of keen interest when coping with an image of a complex scene. DY-APR first captures rough information and then explicitly updates attention weights step by step. Besides, DY-APR fully leverages local linguistic context from the previous tokens, that is, capitalizes on local information when performing global attention — which we refer to as “local–global attention”. We show that the prior knowledge from previous steps provides meaningful semantic information, serving as guidance to build more accurate attention for the latter layers. Experiments on the MS-COCO dataset demonstrate the effectiveness of DY-APR, leading to CIDEr-D improvement by 2.32% with less than 0.2% additional FLOPs and parameters.

论文关键词:00-01,99-00,Image captioning,Attention,Prior knowledge,Linguistic context

论文评审过程:Received 29 March 2021, Revised 9 June 2021, Accepted 14 July 2021, Available online 16 July 2021, Version of Record 21 July 2021.

论文官网地址:https://doi.org/10.1016/j.knosys.2021.107313