Video captioning using boosted and parallel Long Short-Term Memory networks
作者:
Highlights:
•
摘要
Video captioning and its integration with deep learning is one of the most challenging issues in the field of machine vision and artificial intelligence. In this paper, a new boosted and parallel architecture is proposed for video captioning using Long Short-Term Memory (LSTM) networks. The proposed architecture comprises two LSTM layers and a word selection module. The first LSTM layer has the responsibility of encoding frame features extracted by a pre-trained deep Convolutional Neural Network (CNN). In the second LSTM layer, a novel architecture is used for video captioning by leveraging several decoding LSTMs in a parallel and boosting architecture. This layer, which is called Boosted and Parallel LSTM (BP-LSTM) layer, is constructed by iteratively training LSTM networks using a special kind of AdaBoost algorithm during the training phase. During the testing phase, the outputs of BP-LSTMs are concurrently combined using the maximum probability criterion and word selection module. We tested the proposed algorithm with two well-known video captioning datasets and compared the results with state-of-the-art algorithms. The results show that the proposed architecture considerably improves the accuracy of the generated sentence.
论文关键词:
论文评审过程:Received 17 August 2018, Revised 17 February 2019, Accepted 2 October 2019, Available online 11 October 2019, Version of Record 15 November 2019.
论文官网地址:https://doi.org/10.1016/j.cviu.2019.102840