GCM: Efficient video recognition with glance and combine module

作者：

Highlights：

• Glance and combine module (GCM), a highly efficient 3D spatio-temporal convolutional block is proposed for video action recognition.

• GCM performs an extra glancing at a higher scale to get a broader perspective of spatio-temporal features, then combines them at different scales.

• Ablation studies shows the proposed GCM is much more efficient than other forms of 3Dspatio-temporal convolutional blocks.

• On action recognition datasets, GCM achieves SOTA performance with less than two thirds the computational complexity of other models.

• On fine-grained action recognition dataset, GCM beats previous SOTA accuracy achieved with 2-stream methods by more than 6% using only RGB input.

摘要

•Glance and combine module (GCM), a highly efficient 3D spatio-temporal convolutional block is proposed for video action recognition.•GCM performs an extra glancing at a higher scale to get a broader perspective of spatio-temporal features, then combines them at different scales.•Ablation studies shows the proposed GCM is much more efficient than other forms of 3Dspatio-temporal convolutional blocks.•On action recognition datasets, GCM achieves SOTA performance with less than two thirds the computational complexity of other models.•On fine-grained action recognition dataset, GCM beats previous SOTA accuracy achieved with 2-stream methods by more than 6% using only RGB input.

论文关键词：Glance and combine module,Video action recognition,Spatio-temporal convolution,Action recognition datasets

论文评审过程：Received 28 June 2021, Revised 8 July 2022, Accepted 10 August 2022, Available online 11 August 2022, Version of Record 24 August 2022.

论文官网地址：https://doi.org/10.1016/j.patcog.2022.108970