Self-supervised video representation learning by maximizing mutual information

作者：

Highlights：

• We propose a novel self-supervised task DVIM for the representation learning from the unlabeled videos.

• We design a neural network architecture for the DVIM, including the feature extractor and the network for maximizing the mutual information.

• The experimental results demonstrate that the DVIM can serve as an effective pre-training method for the task of action recognition in videos.

• Experiments of action similarity labeling demonstrate that the representations learned by the DVIM can be transferred to other visual tasks.

摘要

•We propose a novel self-supervised task DVIM for the representation learning from the unlabeled videos.•We design a neural network architecture for the DVIM, including the feature extractor and the network for maximizing the mutual information.•The experimental results demonstrate that the DVIM can serve as an effective pre-training method for the task of action recognition in videos.•Experiments of action similarity labeling demonstrate that the representations learned by the DVIM can be transferred to other visual tasks.

论文关键词：Self-supervised learning,Deep learning,Video representation,Mutual information,Action recognition

论文评审过程：Received 7 November 2019, Revised 14 June 2020, Accepted 2 August 2020, Available online 12 August 2020, Version of Record 17 August 2020.

论文官网地址：https://doi.org/10.1016/j.image.2020.115967