VLDeformer: Vision–Language Decomposed Transformer for fast cross-modal retrieval
作者:
Highlights:
• Address the efficiency–accuracy trade-off between one/two stream CMR paradigms.
• Combine the two stream models’ efficiency and one stream models’ accuracy.
• Significantly outperforms two stream methods with similar efficiency.
• Achieves 1000+ times acceleration with less than 0.6% average recall drop.
摘要
•Address the efficiency–accuracy trade-off between one/two stream CMR paradigms.•Combine the two stream models’ efficiency and one stream models’ accuracy.•Significantly outperforms two stream methods with similar efficiency.•Achieves 1000+ times acceleration with less than 0.6% average recall drop.
论文关键词:Image retrieval,Cross-modal retrieval,Visual-semantic embedding,Similarity search,Vision and language
论文评审过程:Received 21 April 2022, Revised 20 June 2022, Accepted 20 June 2022, Available online 4 July 2022, Version of Record 18 July 2022.
论文官网地址:https://doi.org/10.1016/j.knosys.2022.109316