Visual question answering: a state-of-the-art review

作者：Sruthy Manmadhan, Binsu C. Kovoor

摘要

Visual question answering (VQA) is a task that has received immense consideration from two major research communities: computer vision and natural language processing. Recently it has been widely accepted as an AI-complete task which can be used as an alternative to visual turing test. In its most common form, it is a multi-modal challenging task where a computer is required to provide the correct answer for a natural language question asked about an input image. It attracts many deep learning researchers after their remarkable achievements in text, voice and vision technologies. This review extensively and critically examines the current status of VQA research in terms of step by step solution methodologies, datasets and evaluation metrics. Finally, this paper also discusses future research directions for all the above-mentioned aspects of VQA separately.

论文关键词：Visual question answering, Computer vision, Natural language processing, Deep learning

论文评审过程：

论文官网地址：https://doi.org/10.1007/s10462-020-09832-7