Generalized pyramid co-attention with learnable aggregation net for video question answering

Highlights：

• To aggregate the sequential features without destroying the feature distributions and temporal information, we propose a new learnable aggregation component. It imitates Bags-of-Words (BoW) quantization mechanism to automatically aggregate adaptively-weighted frame-level feature (or word-level feature).

• We extensively evaluate the effectiveness of the overall model on two publicly available datasets (i.e., TGIF-QA and TVQA) for V-VQA task. The experimental results demonstrate that our model outperforms the existing state-of-the-art by a large margin and our extended CPTC performs better than MPC. Code and model have been released at: https://github.com/lixiangpengcs/LAD-Net-for-VideoQA.

摘要

•To handle the complexity of videos in V-VQA, we propose a generalized pyramid co-attention mechanism with diversity learning to explicitly encourage accuracy and diverse attention maps. For this generalized module, two possible ways are tried, Multi-path Pyramid Co-attention with diversity learning (MPC) and Cascaded Pyramid Transformer Co-attention with diversity learning (CPTC). This strategy benefits the capturing of distinct, complementary and informative features.•To aggregate the sequential features without destroying the feature distributions and temporal information, we propose a new learnable aggregation component. It imitates Bags-of-Words (BoW) quantization mechanism to automatically aggregate adaptively-weighted frame-level feature (or word-level feature).•We extensively evaluate the effectiveness of the overall model on two publicly available datasets (i.e., TGIF-QA and TVQA) for V-VQA task. The experimental results demonstrate that our model outperforms the existing state-of-the-art by a large margin and our extended CPTC performs better than MPC. Code and model have been released at: https://github.com/lixiangpengcs/LAD-Net-for-VideoQA.