Visual question answering: a state-of-the-art review
作者:Sruthy Manmadhan, Binsu C. Kovoor
摘要
Visual question answering (VQA) is a task that has received immense consideration from two major research communities: computer vision and natural language processing. Recently it has been widely accepted as an AI-complete task which can be used as an alternative to visual turing test. In its most common form, it is a multi-modal challenging task where a computer is required to provide the correct answer for a natural language question asked about an input image. It attracts many deep learning researchers after their remarkable achievements in text, voice and vision technologies. This review extensively and critically examines the current status of VQA research in terms of step by step solution methodologies, datasets and evaluation metrics. Finally, this paper also discusses future research directions for all the above-mentioned aspects of VQA separately.
论文关键词:Visual question answering, Computer vision, Natural language processing, Deep learning
论文评审过程:
论文官网地址:https://doi.org/10.1007/s10462-020-09832-7