Intra- and Inter-modal Multilinear Pooling with Multitask Learning for Video Grounding

作者:Zhou Yu, Yijun Song, Jun Yu, Meng Wang, Qingming Huang

摘要

Video grounding aims to temporally localize an action in an untrimmed video referred to by a query in natural language, which plays an important role in fine-grained video understanding. Given temporal proposals of limited granularity, the task is challenging that it requires fusing multi-modal features from questions and videos effectively, and localizing the referred action accurately. For multimodal feature fusion, we present an Intra- and Inter-modal Multilinear pooling (IIM) model to effectively combine the multi-modal features with considering both the intra- and inter-modal feature interactions. Compared to existing multimodal fusion models, IIM can capture high-order interactions and is more capable for modeling temporal information of videos. For action localization, we propose a simple yet effective multi-task learning framework to simultaneously predict the action label, alignment score and refined location in an end-to-end manner. Experimental results on real-world TaCoS and Charades-STA datasets demonstrate the superiority of the proposed approach over existing state-of-the-art methods.

论文关键词:Video grounding, Multimodal learning, Multimedia data analysis, Deep learning

论文评审过程:

论文官网地址:https://doi.org/10.1007/s11063-020-10205-y