Enhancing the alignment between target words and corresponding frames for video captioning

作者:

Highlights:

• Visual tags are introduced to bridge the gap between vision and language.

• A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.

• Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods.

摘要

•Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods.

论文关键词:Video captioning,Alignment,Visual tags,Textual-temporal attention

论文评审过程:Received 15 April 2020, Revised 27 August 2020, Accepted 13 October 2020, Available online 14 October 2020, Version of Record 6 November 2020.

论文官网地址:https://doi.org/10.1016/j.patcog.2020.107702