Exploiting spatio-temporal representation for 3D human action recognition from depth map sequences

作者:

Highlights:

摘要

Human action recognition based on 3D data is attracting increasing attention because it could provide more abundant spatial and temporal information compared with RGB videos. The challenge of the depth map based method is to capture the cues between spatial appearances and temporal motions. In this paper, we propose a straightforward and efficient framework for modeling the human action based on depth map sequences, considering the short-term and long-term dependencies. A frame-level feature, termed depth-oriented gradient vector (DOGV), is developed to capture the appearance and motion in a short-term duration. For a long-term dependence, we construct convolutional neural networks (CNNs) based backbone to aggregate frame-level features in the space and time sequence. The proposed method is comprehensively evaluated on four public benchmark datasets, including NTU RGB+D, NTU RGB+D 120, PKU-MMD and UOW LSC. The experimental results demonstrate that the proposed approach can solve the problem of 3D human action recognition in an efficient way and achieve the state-of-the-art performance.

论文关键词:3D human action recognition,Depth map sequences,Short-term modeling,Depth-oriented gradient vector

论文评审过程:Received 6 May 2020, Revised 25 February 2021, Accepted 9 April 2021, Available online 26 May 2021, Version of Record 5 June 2021.

论文官网地址:https://doi.org/10.1016/j.knosys.2021.107040