Long term spatio-temporal modeling for action detection

作者:

Highlights:

摘要

Modeling person interactions with their surroundings has proven to be effective for recognizing and localizing human actions in videos. While most recent works focus on learning short term interactions, in this work, we consider long-term person interactions and jointly localize actions of multiple actors over an entire video shot. We construct a graph with nodes that correspond to keyframe actor instances and connect them with two edge types. Spatial edges connect actors within a keyframe, and temporal edges connect multiple instances of the same actor over a video shot. We propose a Graph Neural Network that explicitly models spatial and temporal states for each person instance and learns to effectively combine information from both modalities to make predictions at the same time. We conduct experiments on the AVA dataset and show that our graph-based model provides consistent improvements over several video descriptors, achieving state-of-the-art performance without any fine-tuning.

论文关键词:

论文评审过程:Received 14 September 2020, Revised 6 June 2021, Accepted 9 June 2021, Available online 18 June 2021, Version of Record 15 July 2021.

论文官网地址:https://doi.org/10.1016/j.cviu.2021.103242