Spatio-temporal attention mechanisms based model for collective activity recognition

作者：

Highlights：

•

摘要

Collective activity recognition involving multiple people active and interactive in a collective scenario is a widely-used but challenging domain in computer vision. The key to this end task is how to efficiently explore the spatial and temporal evolutions of the collective activities. In this paper we propose a spatio-temporal attention mechanisms based model to exploit spatial configurations and temporal dynamics in collective scenes. We present ingenious spatio-temporal attention mechanisms built from both deep RGB features and human articulated poses to capture spatio-temporal evolutions of individuals’ actions and the collective activity. Benefited from these attention mechanisms, our model learns to spatially capture unbalanced person–group interactions for each person while updating each individual state based on these interactions, and temporally assess reliabilities of different video frames to predict the final label of the collective activity. Furthermore, the long-range temporal variability and consistency are handled by a two-stage Gated Recurrent Units (GRUs) network. Finally, to ensure effective training of our model, we jointly optimize the losses at both person and group levels to drive the model learning process. Experimental results indicate that our method outperforms the state-of-the-art on Volleyball dataset. More check experiments and visual results demonstrate the effectiveness and practicability of the proposed model.

论文关键词：Multi-person activity recognition,Spatio-temporal model,Attention mechanisms,Multi-modal data,Gated Recurrent Units (GRUs) network

论文评审过程：Received 13 March 2018, Revised 6 December 2018, Accepted 26 February 2019, Available online 28 February 2019, Version of Record 7 March 2019.

论文官网地址：https://doi.org/10.1016/j.image.2019.02.012