Generalized pyramid co-attention with learnable aggregation net for video question answering

作者:

Highlights:

• To handle the complexity of videos in V-VQA, we propose a generalized pyramid co-attention mechanism with diversity learning to explicitly encourage accuracy and diverse attention maps. For this generalized module, two possible ways are tried, Multi-path Pyramid Co-attention with diversity learning (MPC) and Cascaded Pyramid Transformer Co-attention with diversity learning (CPTC). This strategy benefits the capturing of distinct, complementary and informative features.

• To aggregate the sequential features without destroying the feature distributions and temporal information, we propose a new learnable aggregation component. It imitates Bags-of-Words (BoW) quantization mechanism to automatically aggregate adaptively-weighted frame-level feature (or word-level feature).

• We extensively evaluate the effectiveness of the overall model on two publicly available datasets (i.e., TGIF-QA and TVQA) for V-VQA task. The experimental results demonstrate that our model outperforms the existing state-of-the-art by a large margin and our extended CPTC performs better than MPC. Code and model have been released at: https://github.com/lixiangpengcs/LAD-Net-for-VideoQA.

摘要

•To handle the complexity of videos in V-VQA, we propose a generalized pyramid co-attention mechanism with diversity learning to explicitly encourage accuracy and diverse attention maps. For this generalized module, two possible ways are tried, Multi-path Pyramid Co-attention with diversity learning (MPC) and Cascaded Pyramid Transformer Co-attention with diversity learning (CPTC). This strategy benefits the capturing of distinct, complementary and informative features.•To aggregate the sequential features without destroying the feature distributions and temporal information, we propose a new learnable aggregation component. It imitates Bags-of-Words (BoW) quantization mechanism to automatically aggregate adaptively-weighted frame-level feature (or word-level feature).•We extensively evaluate the effectiveness of the overall model on two publicly available datasets (i.e., TGIF-QA and TVQA) for V-VQA task. The experimental results demonstrate that our model outperforms the existing state-of-the-art by a large margin and our extended CPTC performs better than MPC. Code and model have been released at: https://github.com/lixiangpengcs/LAD-Net-for-VideoQA.

论文关键词:Video question answering,Diversity learning,Learnable aggregation,Cascaded pyramid transformer co-attention

论文评审过程:Received 21 November 2019, Revised 28 February 2021, Accepted 27 June 2021, Available online 30 June 2021, Version of Record 25 July 2021.

论文官网地址:https://doi.org/10.1016/j.patcog.2021.108145