Enhancing the alignment between target words and corresponding frames for video captioning
作者:
Highlights:
• Visual tags are introduced to bridge the gap between vision and language.
• A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.
• Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods.
摘要
•Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods.
论文关键词:Video captioning,Alignment,Visual tags,Textual-temporal attention
论文评审过程:Received 15 April 2020, Revised 27 August 2020, Accepted 13 October 2020, Available online 14 October 2020, Version of Record 6 November 2020.
论文官网地址:https://doi.org/10.1016/j.patcog.2020.107702