VLDeformer: Vision–Language Decomposed Transformer for fast cross-modal retrieval

作者:

Highlights:

• Address the efficiency–accuracy trade-off between one/two stream CMR paradigms.

• Combine the two stream models’ efficiency and one stream models’ accuracy.

• Significantly outperforms two stream methods with similar efficiency.

• Achieves 1000+ times acceleration with less than 0.6% average recall drop.

摘要

•Address the efficiency–accuracy trade-off between one/two stream CMR paradigms.•Combine the two stream models’ efficiency and one stream models’ accuracy.•Significantly outperforms two stream methods with similar efficiency.•Achieves 1000+ times acceleration with less than 0.6% average recall drop.

论文关键词:Image retrieval,Cross-modal retrieval,Visual-semantic embedding,Similarity search,Vision and language

论文评审过程:Received 21 April 2022, Revised 20 June 2022, Accepted 20 June 2022, Available online 4 July 2022, Version of Record 18 July 2022.

论文官网地址:https://doi.org/10.1016/j.knosys.2022.109316