On the role of question encoder sequence model in robust visual question answering

作者:

Highlights:

• The question-encoder sequence model plays a significant role in overfitting the VQA models to the train set language biases and reducing the performance on Out-of-Distribution test sets.

• A comprehensive study of existing RNN-based and Transformer-based question-encoders on the Out-of-Distribution performance in VQA.

• Proposal of a novel question-encoder GAT-QE for VQA that shows better resilience to language biases and improves the Out-of-Distribution performance even without using additional bias-mitigation approaches.

摘要

•The question-encoder sequence model plays a significant role in overfitting the VQA models to the train set language biases and reducing the performance on Out-of-Distribution test sets.•A comprehensive study of existing RNN-based and Transformer-based question-encoders on the Out-of-Distribution performance in VQA.•Proposal of a novel question-encoder GAT-QE for VQA that shows better resilience to language biases and improves the Out-of-Distribution performance even without using additional bias-mitigation approaches.

论文关键词:Visual question answering,Out-of-distribution performance,Gated recurrent unit,Transformer,Graph attention network

论文评审过程:Received 21 December 2021, Revised 21 June 2022, Accepted 29 June 2022, Available online 3 July 2022, Version of Record 9 July 2022.

论文官网地址:https://doi.org/10.1016/j.patcog.2022.108883