CMAFGAN: A Cross-Modal Attention Fusion based Generative Adversarial Network for attribute word-to-face synthesis

作者:

Highlights:

摘要

Face synthesis based on attribute words is a novel and challenging topic in computer vision, which has various application potentials in public security and multimedia. Existing attribute vector-to-face (V2F) synthesis methods mainly generate faces based on attribute label vectors that lack rich semantic feature information, which leads to low-quality generated face images. To address this challenge, we advocate attribute word-to-face (W2F) synthesis, using attribute-word sequences that contain rich semantic information as input. A novel Cross-Modal Attention Fusion based Generative Adversarial Network (CMAFGAN) is proposed to generate faces from facial attribute words. CMAFGAN is highlighted by two blocks, cross-modal attention fusion (CMAF) and word feature transformation (WFT), which are proposed to explore the correlation between image features and the corresponding attribute word features. Experimental results on the CelebA and LFW datasets demonstrate that our CMAFGAN achieves state-of-the-art performance, effectively improving the quality of the synthesised faces. In particular, the consistency between the predicted images and input attribute words (R-precision) on the CelebA and LFW datasets achieved 61.24% and 64.46% respectively, which is significantly better than previous methods. In addition, CMAFGAN achieves comparable or better performance than the current best methods of text-to-image synthesis (R-precision 83.41% on caltech-ucsd birds-200-2011, CUB).

论文关键词:Face synthesis,Conditional generative adversarial network,Attribute word-to-face synthesis,Word feature transformation,Cross-modal attention fusion

论文评审过程:Received 23 March 2022, Revised 17 August 2022, Accepted 18 August 2022, Available online 24 August 2022, Version of Record 5 September 2022.

论文官网地址:https://doi.org/10.1016/j.knosys.2022.109750