DCT-net: A deep co-interactive transformer network for video temporal grounding

作者:

Highlights:

摘要

Language-guided video temporal grounding is to temporally localize the best matched video segment in an untrimmed long video according to a given natural language query (sentence). The main challenge in this task lies in how to fuse visual and linguistic information effectively. Recent works have shown that the attention mechanism is beneficial to the multi-modal feature fusion process. In this paper, we present a concise yet valid Deep Co-Interactive Transformer Network (DCT-Net) which repurposes a Transformer-style architecture to sufficiently model cross modality interactions. It consists of Co-Interactive Transformer (CIT) layers cascaded in depth for multi-step interactions between a video-sentence pair. With the help of the proposed CIT layer, both visual and language features can share the mutually improved benefits from each other. Extensive experiments on two public datasets, i.e. ActivityNet-Caption and TACOS, demonstrate the effectiveness of our proposed model compared to state-of-the-art methods.

论文关键词:Video temporal grounding,Co-interactive transformer,Multi-modal feature fusion

论文评审过程:Received 12 March 2021, Accepted 10 April 2021, Available online 21 April 2021, Version of Record 28 April 2021.

论文官网地址:https://doi.org/10.1016/j.imavis.2021.104183