Video question answering via grounded cross-attention network learning

作者:

Highlights:

• We study the problem of video question answering from the viewpoint of modeling the rough video representation and the grounded video representation. The joint question-video representation based on rough representation and grounded representation of video is learned for answer predicting.

• We propose the grounded cross-attention network learning framework, which is a novel hierarchical cross-attention method with a Q-O cross-attention layer and a Q-V- H cross-attention layer. The proposed GCANet adopts a novel mutual attention learning mechanism.

• We construct two large-scale datasets for video question answering. The extensive experiments validate the effectiveness of our method.

摘要

•We study the problem of video question answering from the viewpoint of modeling the rough video representation and the grounded video representation. The joint question-video representation based on rough representation and grounded representation of video is learned for answer predicting.•We propose the grounded cross-attention network learning framework, which is a novel hierarchical cross-attention method with a Q-O cross-attention layer and a Q-V- H cross-attention layer. The proposed GCANet adopts a novel mutual attention learning mechanism.•We construct two large-scale datasets for video question answering. The extensive experiments validate the effectiveness of our method.

论文关键词:Visual information retrieval,Video question answering,Cross-attention

论文评审过程:Received 31 December 2019, Revised 28 February 2020, Accepted 7 April 2020, Available online 16 April 2020, Version of Record 16 April 2020.

论文官网地址:https://doi.org/10.1016/j.ipm.2020.102265