Diversified text-to-image generation via deep mutual information estimation

作者:

Highlights:

摘要

Generating photo-realistic, text-matched, and diverse images simultaneously from given text descriptions is a challenging task in computer vision. Previous works mostly focus on visual realism and semantic relevance, but neglect variations of the generated results which is also an important target in text-to-image generation (T2I). In this paper, we design a new module that improves the training of generative adversarial nets (GANs) to generate diverse images from a text input. Our T2I method is based on conditional GANs and has three components: contextual text embedding module (CTEM), deep Mutual Information (MI) estimation module for stacked image generation (DMIEM), and text-image semantic relevance module (TISRM). CTEM attempts to learn text embeddings via leveraging the fine-tuning capabilities of BERT. DMIEM has a stack of attention embedded generators, integrating with global/local MI estimation and maximization between input and output of the generator to make the T2I mapping more relevant, and then generating diverse and photo-realistic images progressively. TISRM is introduced to enhance the semantic consistency of the text-image pairs by regenerating the text descriptions from the generated images. Extensive experiments on three datasets indicate that our method can generate text-matched and more diverse images without quality degradation compared to the state-of-the-art approaches.

论文关键词:

论文评审过程:Received 5 January 2021, Revised 4 August 2021, Accepted 9 August 2021, Available online 16 August 2021, Version of Record 3 September 2021.

论文官网地址:https://doi.org/10.1016/j.cviu.2021.103259