Fact-based visual question answering via dual-process system

作者:

Highlights:

摘要

Fact-based visual question answering (FVQA) requires the model to answer questions based on the observed images and external knowledge. The key is to enable the agent to understand questions and images and then reason on the knowledge base to find the correct answer. Founded on the dual-process theory in cognitive science, an effective framework for the FVQA is proposed in this study by coordinating a perception module (System 1) and an explicit reasoning module (System 2). When a question and an image are given, System 1 first learns the joint representation of them, and then System 2 predicts the answer via reasoning on a fact graph and a semantic graph. Specifically, System 1 is implemented by a two-parallel BERT-style model, while System 2 by a graph neural network (GNN) with a dual-level attention mechanism. Experiments on two public datasets, i.e., FVQA and OK-VQA datasets, show that our model outperforms other baselines. Moreover, the proposed model also provides the interpretation of the reasoning process in addition to a correct answer to the question.

论文关键词:Fact-based VQA,Dual-process theory,Multimodal transformer,Graph reasoning,Dual-level attention

论文评审过程:Received 30 March 2021, Revised 30 July 2021, Accepted 24 October 2021, Available online 29 October 2021, Version of Record 20 December 2021.

论文官网地址:https://doi.org/10.1016/j.knosys.2021.107650