Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language

作者:

Highlights:

摘要

Recently, video captioning has achieved significant progress through the advances of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). Given a video, deep learning approach is applied to encode the visual information and generate the corresponding caption. However, this direct visual to textual translation ignores the rich intermediate description, such as objects, scenes, actions, etc. In this paper, we proposed to discover and integrate the rich and primeval external knowledge (i.e., frame-based image caption) to benefit the video caption task. We propose a Hierarchical & Multimodal Video Caption (HMVC) model to jointly learn the dynamics within both visual and textual modalities for video caption task, which infers an arbitrary length sentence according to the input video with arbitrary number of frames. Specifically, we argue that the module for latent semantic discovery transfers external knowledge to generate complex and helpful complementary cues. We comprehensively evaluate the HMVC model on the Microsoft Video Description Corpus (MSVD), the MPII Movie Description Dataset (MPII-MD), and the novel dataset for 2016 MSR Video to Text challenge (MSR-VTT), and have attained a competitive performance. In addition, we evaluate the generalization properties of the proposed model by fine-tuning and evaluating the model on different datasets. To the best of our knowledge, this is the first time such analysis has been applied for the video caption task.

论文关键词:

论文评审过程:Received 14 September 2016, Revised 11 April 2017, Accepted 27 April 2017, Available online 8 May 2017, Version of Record 23 November 2017.

论文官网地址:https://doi.org/10.1016/j.cviu.2017.04.013