Dual-stream cross-modality fusion transformer for RGB-D action recognition
作者:
Highlights:
•
摘要
RGB-D-based action recognition can achieve accurate and robust performance due to rich complementary information, and thus has many application scenarios. However, existing works combine multiple modalities by late fusion or learn multimodal representation with simple feature-level fusion methods, which fail to effectively utilize complementary semantic information and model interactions between unimodal features. In this paper, we design a self-attention-based modal enhancement module (MEM) and a cross-attention-based modal interaction module (MIM) to enhance and fuse RGB and depth features. Moreover, a novel bottleneck excitation feed-forward block (BEF) is proposed to enhance the expression ability of the model with few extra parameters and computational overhead. By integrating these two modules with BEFs, one basic fusion layer of the cross-modality fusion transformer is obtained. We apply the transformer on top of the dual-stream convolutional neural networks (ConvNets) to build a dual-stream cross-modality fusion transformer (DSCMT) for RGB-D action recognition. Extensive experiments on the NTU RGB+D 120, PKU-MMD, and THU-READ datasets verify the effectiveness and superiority of the DSCMT. Furthermore, our DSCMT can still make considerable improvements when changing convolutional backbones or when applied to different multimodal combinations, indicating its universality and scalability. The code is available at https://github.com/liuzwin98/DSCMT.
论文关键词:Action recognition,Multimodal fusion,Transformer,ConvNets,Dual-stream
论文评审过程:Received 8 May 2022, Revised 16 August 2022, Accepted 17 August 2022, Available online 22 August 2022, Version of Record 6 September 2022.
论文官网地址:https://doi.org/10.1016/j.knosys.2022.109741