Gesture recognition based on deep deformable 3D convolutional neural networks

作者:

Highlights:

• We articulate the differences between gesture recognition and general action recognition in three aspects: 1) Background context can provide useful knowledge for action recognition, however, is useless and even harmful in gesture recognition. 2) The motion of hands and arms are more crucial part in gesture recognition than in action recognition. 3) Gesture recognition is more sensitive to the computational complexity than action recognition as it is mainly used in a real time human computer interaction system. Therefore, we argue that the gesture recognition is a fine-grained classification task. The model needs to focus more on the spatial appearance and temporal motion of the hands and arms

• We design a light-weight spatiotemporal deformable convolution module which enable free-form deformation of the sampling grid for a convolution kernel on both spatial and temporal dimensions. The proposed model achieves the state-of-the-art performance on three challenging datasets, EgoGesture, Jester and Chalearn.

• We provide an insight that the benefit of plugging the spatiotemporal deformable convolution module to the higher level layer is larger than to the lower level layer.

• We propose a data spatiotemporal augmentation method to randomly generate diverse data samples in a spatiotemporal cube, which is proved to be effective for model training.

摘要

•We articulate the differences between gesture recognition and general action recognition in three aspects: 1) Background context can provide useful knowledge for action recognition, however, is useless and even harmful in gesture recognition. 2) The motion of hands and arms are more crucial part in gesture recognition than in action recognition. 3) Gesture recognition is more sensitive to the computational complexity than action recognition as it is mainly used in a real time human computer interaction system. Therefore, we argue that the gesture recognition is a fine-grained classification task. The model needs to focus more on the spatial appearance and temporal motion of the hands and arms•We design a light-weight spatiotemporal deformable convolution module which enable free-form deformation of the sampling grid for a convolution kernel on both spatial and temporal dimensions. The proposed model achieves the state-of-the-art performance on three challenging datasets, EgoGesture, Jester and Chalearn.•We provide an insight that the benefit of plugging the spatiotemporal deformable convolution module to the higher level layer is larger than to the lower level layer.•We propose a data spatiotemporal augmentation method to randomly generate diverse data samples in a spatiotemporal cube, which is proved to be effective for model training.

论文关键词:Gesture recognition,Spatiotemporal deformable convolution,Spatiotemporal convolutional neural network

论文评审过程:Received 31 May 2019, Revised 24 April 2020, Accepted 29 April 2020, Available online 24 June 2020, Version of Record 6 July 2020.

论文官网地址:https://doi.org/10.1016/j.patcog.2020.107416