GCM: Efficient video recognition with glance and combine module
作者:
Highlights:
• Glance and combine module (GCM), a highly efficient 3D spatio-temporal convolutional block is proposed for video action recognition.
• GCM performs an extra glancing at a higher scale to get a broader perspective of spatio-temporal features, then combines them at different scales.
• Ablation studies shows the proposed GCM is much more efficient than other forms of 3Dspatio-temporal convolutional blocks.
• On action recognition datasets, GCM achieves SOTA performance with less than two thirds the computational complexity of other models.
• On fine-grained action recognition dataset, GCM beats previous SOTA accuracy achieved with 2-stream methods by more than 6% using only RGB input.
摘要
•Glance and combine module (GCM), a highly efficient 3D spatio-temporal convolutional block is proposed for video action recognition.•GCM performs an extra glancing at a higher scale to get a broader perspective of spatio-temporal features, then combines them at different scales.•Ablation studies shows the proposed GCM is much more efficient than other forms of 3Dspatio-temporal convolutional blocks.•On action recognition datasets, GCM achieves SOTA performance with less than two thirds the computational complexity of other models.•On fine-grained action recognition dataset, GCM beats previous SOTA accuracy achieved with 2-stream methods by more than 6% using only RGB input.
论文关键词:Glance and combine module,Video action recognition,Spatio-temporal convolution,Action recognition datasets
论文评审过程:Received 28 June 2021, Revised 8 July 2022, Accepted 10 August 2022, Available online 11 August 2022, Version of Record 24 August 2022.
论文官网地址:https://doi.org/10.1016/j.patcog.2022.108970