KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for visual commonsense reasoning

作者:

Highlights:

摘要

Reasoning is a critical ability towards complete visual understanding. To develop machine with cognition-level visual understanding and reasoning abilities, the visual commonsense reasoning (VCR) task has been introduced. In VCR, given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. The methods adopting the powerful BERT model as the backbone for learning joint representation of image content and natural language have shown promising improvements on VCR. However, none of the existing methods have utilized commonsense knowledge in visual commonsense reasoning, which we believe will be greatly helpful in this task. Therefore, we incorporate commonsense knowledge into the cross-modal BERT, and propose a novel Knowledge Enhanced Visual-and-Linguistic BERT (KVL-BERT for short) model. Besides taking visual and linguistic contents as input, external commonsense knowledge is integrated into the multi-layer Transformer. In order to preserve the structural information and semantic representation of the original sentence, we propose an algorithm called RMGSR (Relative-position-embedding and Mask-self-attention Guided Semantic Representations). Compared to other task-specific models and general task-agnostic pre-training models, our KVL-BERT outperforms them.

论文关键词:Visual commonsense reasoning,Multimodal BERT,Commonsense knowledge integration

论文评审过程:Received 23 November 2020, Revised 7 July 2021, Accepted 16 August 2021, Available online 19 August 2021, Version of Record 24 August 2021.

论文官网地址:https://doi.org/10.1016/j.knosys.2021.107408