Learning visual and textual representations for multimodal matching and classification

作者:

Highlights:

• A unified network for image-text matching and classification.

• Seamlessly incorporating the matching and classification components.

• A multi-stage training algorithm by combining the matching and classification loss.

• Comprehensive study on the effectiveness of the proposed approach.

• Comparisons on four well-known multimodal benchmarks.

摘要

•A unified network for image-text matching and classification.•Seamlessly incorporating the matching and classification components.•A multi-stage training algorithm by combining the matching and classification loss.•Comprehensive study on the effectiveness of the proposed approach.•Comparisons on four well-known multimodal benchmarks.

论文关键词:Vision and language,Multimodal matching,Multimodal classification,Deep learning

论文评审过程:Received 29 August 2017, Revised 22 May 2018, Accepted 1 July 2018, Available online 2 July 2018, Version of Record 10 July 2018.

论文官网地址:https://doi.org/10.1016/j.patcog.2018.07.001