Dual self-attention with co-attention networks for visual question answering
作者:
Highlights:
• A novel model based on the self-attention mechanism is proposed to learn more effective multi-modal representations.
• The DSACA model is proposed to capture the internal dependencies and cross-modal correlation between the image and question sentence.
• Extensive experiments and analysis confirm the superiority of the proposed DSACA.
摘要
•A novel model based on the self-attention mechanism is proposed to learn more effective multi-modal representations.•The DSACA model is proposed to capture the internal dependencies and cross-modal correlation between the image and question sentence.•Extensive experiments and analysis confirm the superiority of the proposed DSACA.
论文关键词:Self-attention,Visual-textual co-attention,Visual question answering
论文评审过程:Received 6 November 2019, Revised 30 November 2020, Accepted 18 March 2021, Available online 9 April 2021, Version of Record 16 April 2021.
论文官网地址:https://doi.org/10.1016/j.patcog.2021.107956