Temporal modelling of first-person actions using hand-centric verb and object streams
作者:
Highlights:
• We propose a new first-person action decomposition model with verb–noun components.
• We propose a multiscale hand-centric verb component using 3D ConvNet architectures.
• We introduce RNN-based fusion strategies to combine verb and object components.
• We show that RNN-based fusion strategies outperform count-based models.
• We show competitive results compared to SOTA models on the EGTEA Gaze+ dataset.
摘要
•We propose a new first-person action decomposition model with verb–noun components.•We propose a multiscale hand-centric verb component using 3D ConvNet architectures.•We introduce RNN-based fusion strategies to combine verb and object components.•We show that RNN-based fusion strategies outperform count-based models.•We show competitive results compared to SOTA models on the EGTEA Gaze+ dataset.
论文关键词:First-person vision,Egocentric vision,Action recognition,Temporal models,RNN
论文评审过程:Received 2 February 2020, Revised 6 June 2021, Accepted 11 August 2021, Available online 19 August 2021, Version of Record 26 August 2021.
论文官网地址:https://doi.org/10.1016/j.image.2021.116436