Residual attention unit for action recognition

作者：

Highlights：

•

摘要

3D CNNs are powerful tools for action recognition that can intuitively extract spatio-temporal features from raw videos. However, most of the existing 3D CNNs have not fully considered the disadvantageous effects of the background motion that frequently appears in videos. The background motion is usually misclassified as a part of human action, which may undermine modeling the dynamic pattern of the action. In this paper, we propose the residual attention unit (RAU) to address this problem. RAU aims to suppress the background motion by upweighting the values associated with the foreground region in the feature maps. Specifically, RAU contains two separate submodules in parallel, i.e., spatial attention as well as channel-wise attention. Given an intermediate feature map, the spatial attention works in a bottom-up top-down manner to generate the attention mask, while the channel-wise attention recalibrates the feature responses of all channels automatically. As applying the attention mechanism directly to the input features may lead to the loss of discriminative information, we design a bypass to preserve the integrity of the original features by a shortcut connection between the input and output of the attention module. Notably, our RAU can be embedded into 3D CNNs easily and enables end-to-end training along with the networks. The experimental results on UCF101 and HMDB51 demonstrate the validity of our RAU.

论文关键词：

论文评审过程：Received 9 November 2018, Revised 4 September 2019, Accepted 9 September 2019, Available online 20 September 2019, Version of Record 1 November 2019.

论文官网地址：https://doi.org/10.1016/j.cviu.2019.102821