Multimodal video-text matching using a deep bifurcation network and joint embedding of visual and textual features

作者：

Highlights：

• An efficient deep bifurcation network is proposed for video-text matching.

• A comprehensive set of visual and text features are used to enhance the performance.

• The features are transferred to a common semantic space for video-text matching.

• The proposed features and architecture enhance the performance, considerably.

• The use of information from image captioning databases enhances the performance.

摘要

•An efficient deep bifurcation network is proposed for video-text matching.•A comprehensive set of visual and text features are used to enhance the performance.•The features are transferred to a common semantic space for video-text matching.•The proposed features and architecture enhance the performance, considerably.•The use of information from image captioning databases enhances the performance.

论文关键词：Video-text matching,Video-caption retrieval,Bifurcation network,Deep neural network

论文评审过程：Received 5 June 2020, Revised 21 May 2021, Accepted 30 June 2021, Available online 5 July 2021, Version of Record 15 July 2021.

论文官网地址：https://doi.org/10.1016/j.eswa.2021.115541