Learning visual and textual representations for multimodal matching and classification

作者：

Highlights：

• A unified network for image-text matching and classification.

• Seamlessly incorporating the matching and classification components.

• A multi-stage training algorithm by combining the matching and classification loss.

• Comprehensive study on the effectiveness of the proposed approach.

• Comparisons on four well-known multimodal benchmarks.

摘要

•A unified network for image-text matching and classification.•Seamlessly incorporating the matching and classification components.•A multi-stage training algorithm by combining the matching and classification loss.•Comprehensive study on the effectiveness of the proposed approach.•Comparisons on four well-known multimodal benchmarks.

论文关键词：Vision and language,Multimodal matching,Multimodal classification,Deep learning

论文评审过程：Received 29 August 2017, Revised 22 May 2018, Accepted 1 July 2018, Available online 2 July 2018, Version of Record 10 July 2018.

论文官网地址：https://doi.org/10.1016/j.patcog.2018.07.001