Dual self-attention with co-attention networks for visual question answering

作者：

Highlights：

• A novel model based on the self-attention mechanism is proposed to learn more effective multi-modal representations.

• The DSACA model is proposed to capture the internal dependencies and cross-modal correlation between the image and question sentence.

• Extensive experiments and analysis confirm the superiority of the proposed DSACA.

摘要

•A novel model based on the self-attention mechanism is proposed to learn more effective multi-modal representations.•The DSACA model is proposed to capture the internal dependencies and cross-modal correlation between the image and question sentence.•Extensive experiments and analysis confirm the superiority of the proposed DSACA.

论文关键词：Self-attention,Visual-textual co-attention,Visual question answering

论文评审过程：Received 6 November 2019, Revised 30 November 2020, Accepted 18 March 2021, Available online 9 April 2021, Version of Record 16 April 2021.

论文官网地址：https://doi.org/10.1016/j.patcog.2021.107956