Linguistically-aware attention for reducing the semantic gap in vision-language tasks
作者:
Highlights:
• Proposal of a generic Linguistically-aware Attention (LAT) to reduce the semantic-gap between the modalities in Vision-language tasks.
• Proposal of a novel Counting-VQA model that shows state-of-the-art results in five counting-specific VQA datasets.
• Adaptation of LAT into various state-of-the-art VQA models such as UpDn, MUREL and BAN. In all the models LAT improves the performance.
• Adaptation of LAT into the best performing object-level attention based captioning model (UpDn model). Incorporation of LAT improves the captioning performance of the baseline model.
摘要
•Proposal of a generic Linguistically-aware Attention (LAT) to reduce the semantic-gap between the modalities in Vision-language tasks.•Proposal of a novel Counting-VQA model that shows state-of-the-art results in five counting-specific VQA datasets.•Adaptation of LAT into various state-of-the-art VQA models such as UpDn, MUREL and BAN. In all the models LAT improves the performance.•Adaptation of LAT into the best performing object-level attention based captioning model (UpDn model). Incorporation of LAT improves the captioning performance of the baseline model.
论文关键词:Attention models,Visual question answering,Counting in visual question answering,Image captioning
论文评审过程:Received 12 March 2020, Revised 14 December 2020, Accepted 26 December 2020, Available online 1 January 2021, Version of Record 8 January 2021.
论文官网地址:https://doi.org/10.1016/j.patcog.2020.107812