Weakly supervised action segmentation with effective use of attention and self-attention

作者：

Highlights：

•

摘要

This paper generates human action sequences using a novel hybrid sequence-to-sequence model that outputs a sequence of actions in the chronological order of the actions being performed in the longer activity of a given video. At test time, our models are able to generate action for each frame using weak supervision. We evaluate several sequence-to-sequence models to solve this task and demonstrate that they are able to solve action segment generation tasks on three challenging action recognition datasets. We present how to use self-attention and standard attention mechanisms with known sequence-to-sequence models for weakly supervised video action segmentation. Our new architecture is effective for weakly supervised action segmentation that uses a combination of recurrent and transformer-based sequence-to-sequence models. Our architecture consists of Transformers and GRU encoders to encode temporal information and we use self-attention and standard attention during the decoding process. We introduce an effective positional weight prior to further improve action segmentation performance. Using this architecture and two types of attention along with positional weight priors, we obtain state-of-the-art results on Breakfast and 50Salads datasets for weakly supervised action segmentation.

论文关键词：

论文评审过程：Received 16 December 2020, Revised 21 September 2021, Accepted 1 October 2021, Available online 12 October 2021, Version of Record 28 October 2021.

论文官网地址：https://doi.org/10.1016/j.cviu.2021.103298