Dynamic self-attention with vision synchronization networks for video question answering

作者:

Highlights:

• A novel token selection mechanism based on the dynamic self-attention network is proposed to automatically extract important video features.

• A vision synchronization network is proposed to align appearance and motion features at the time slice level.

• Extensive experiments and analysis confirm the superiority of the proposed model DSAVS.

摘要

•A novel token selection mechanism based on the dynamic self-attention network is proposed to automatically extract important video features.•A vision synchronization network is proposed to align appearance and motion features at the time slice level.•Extensive experiments and analysis confirm the superiority of the proposed model DSAVS.

论文关键词:Video question answering,Dynamic self-attention,Vision synchronization

论文评审过程:Received 21 May 2021, Revised 14 July 2022, Accepted 7 August 2022, Available online 12 August 2022, Version of Record 19 August 2022.

论文官网地址:https://doi.org/10.1016/j.patcog.2022.108959