Learning visual and textual representations for multimodal matching and classification
作者:
Highlights:
• A unified network for image-text matching and classification.
• Seamlessly incorporating the matching and classification components.
• A multi-stage training algorithm by combining the matching and classification loss.
• Comprehensive study on the effectiveness of the proposed approach.
• Comparisons on four well-known multimodal benchmarks.
摘要
•A unified network for image-text matching and classification.•Seamlessly incorporating the matching and classification components.•A multi-stage training algorithm by combining the matching and classification loss.•Comprehensive study on the effectiveness of the proposed approach.•Comparisons on four well-known multimodal benchmarks.
论文关键词:Vision and language,Multimodal matching,Multimodal classification,Deep learning
论文评审过程:Received 29 August 2017, Revised 22 May 2018, Accepted 1 July 2018, Available online 2 July 2018, Version of Record 10 July 2018.
论文官网地址:https://doi.org/10.1016/j.patcog.2018.07.001