Temporal modelling of first-person actions using hand-centric verb and object streams

作者：

Highlights：

• We propose a new first-person action decomposition model with verb–noun components.

• We propose a multiscale hand-centric verb component using 3D ConvNet architectures.

• We introduce RNN-based fusion strategies to combine verb and object components.

• We show that RNN-based fusion strategies outperform count-based models.

• We show competitive results compared to SOTA models on the EGTEA Gaze+ dataset.

摘要

•We propose a new first-person action decomposition model with verb–noun components.•We propose a multiscale hand-centric verb component using 3D ConvNet architectures.•We introduce RNN-based fusion strategies to combine verb and object components.•We show that RNN-based fusion strategies outperform count-based models.•We show competitive results compared to SOTA models on the EGTEA Gaze+ dataset.

论文关键词：First-person vision,Egocentric vision,Action recognition,Temporal models,RNN

论文评审过程：Received 2 February 2020, Revised 6 June 2021, Accepted 11 August 2021, Available online 19 August 2021, Version of Record 26 August 2021.

论文官网地址：https://doi.org/10.1016/j.image.2021.116436