Cross-modal recipe retrieval via parallel- and cross-attention networks learning
作者:
Highlights:
•
摘要
Cross-modal recipe retrieval refers to the problem of retrieving a food image from a list of image candidates given a textual recipe as the query, or the reverse side. However, existing cross-modal recipe retrieval approaches mostly focus on learning the representations of images and recipes independently and sewing them up by projecting them into a common space. Such methods overlook the interplay between images and recipes, resulting in the suboptimal retrieval performance.Toward this end, we study the problem of cross-modal recipe retrieval from the viewpoint of parallel- and cross-attention networks learning. Specifically, we first exploit a parallel-attention network to independently learn the attention weights of components in images and recipes. Thereafter, a cross-attention network is proposed to explicitly learn the interplay between images and recipes, which simultaneously considers word-guided image attention and image-guided word attention. Lastly, the learnt representations of images and recipes stemming from parallel- and cross-attention networks are elaborately connected and optimized using a pairwise ranking loss. By experimenting on two datasets, we demonstrate the effectiveness and rationality of our proposed solution on the scope of both overall performance comparison and micro-level analyses.
论文关键词:Recipe retrieval,Parallel-attention network,Cross-attention network,Cross-modal retrieval
论文评审过程:Received 3 May 2019, Revised 7 October 2019, Accepted 22 December 2019, Available online 24 December 2019, Version of Record 7 March 2020.
论文官网地址:https://doi.org/10.1016/j.knosys.2019.105428