Multimodal video-text matching using a deep bifurcation network and joint embedding of visual and textual features
作者:
Highlights:
• An efficient deep bifurcation network is proposed for video-text matching.
• A comprehensive set of visual and text features are used to enhance the performance.
• The features are transferred to a common semantic space for video-text matching.
• The proposed features and architecture enhance the performance, considerably.
• The use of information from image captioning databases enhances the performance.
摘要
•An efficient deep bifurcation network is proposed for video-text matching.•A comprehensive set of visual and text features are used to enhance the performance.•The features are transferred to a common semantic space for video-text matching.•The proposed features and architecture enhance the performance, considerably.•The use of information from image captioning databases enhances the performance.
论文关键词:Video-text matching,Video-caption retrieval,Bifurcation network,Deep neural network
论文评审过程:Received 5 June 2020, Revised 21 May 2021, Accepted 30 June 2021, Available online 5 July 2021, Version of Record 15 July 2021.
论文官网地址:https://doi.org/10.1016/j.eswa.2021.115541