Weakly Supervised Moment Localization with Decoupled Consistent Concept Prediction

作者：Fan Ma, Linchao Zhu, Yi Yang

摘要

Localizing moments in a video via natural language queries is a challenging task where models are trained to identify the start and the end timestamps of the moment in a video. However, it is labor intensive to obtain the temporal endpoint annotations. In this paper, we focus on a weakly supervised setting, where the temporal endpoints of moments are not available during training. We develop a decoupled consistent concept prediction (DCCP) framework to learn the relations between videos and query texts. Specifically, the atomic objects and actions are decoupled from the query text to facilitate the recognition of these concepts in videos. We introduce a concept pairing module to temporally localize the objects and actions in the video. The classification loss and the concept consistency loss are proposed to leverage the mutual benefits of object and action cues for building relations between languages and videos. Extensive experiments on DiDeMo, Charades-STA, and ActivityNet Captions demonstrate the effectiveness of our model.

论文关键词：Video moment localization, Language query, Weakly supervised learning

论文评审过程：

论文官网地址：https://doi.org/10.1007/s11263-022-01600-0