GuessWhich? Visual dialog with attentive memory network

Highlights：

• We use memory network in the cooperative ‘GuessWhich’ game between Q-BOT and A-BOT. It reduces the repetition of the generated dialogs and makes image retrieval efficient.

• We propose a novel Attentive Memory Network that adds a fusion model to the memory network. The fusion model can effectively use the manually labeled caption and the image. Thus the generated dialogs and the predicted image representation can be visually grounded.

• Experiments conducted on VisDial 1.0 datasets demonstrate that our generated dialogs are natural and precise, and the results exceed the state-of-the-art ‘GuessWhich’ based visual dialog algorithms. Extensive image retrieval experiments prove that our method also can generate more accurate results compared to the benchmarks.

摘要

•We use memory network in the cooperative ‘GuessWhich’ game between Q-BOT and A-BOT. It reduces the repetition of the generated dialogs and makes image retrieval efficient.•We propose a novel Attentive Memory Network that adds a fusion model to the memory network. The fusion model can effectively use the manually labeled caption and the image. Thus the generated dialogs and the predicted image representation can be visually grounded.•Experiments conducted on VisDial 1.0 datasets demonstrate that our generated dialogs are natural and precise, and the results exceed the state-of-the-art ‘GuessWhich’ based visual dialog algorithms. Extensive image retrieval experiments prove that our method also can generate more accurate results compared to the benchmarks.

论文评审过程：Received 3 August 2020, Revised 31 December 2020, Accepted 2 January 2021, Available online 14 January 2021, Version of Record 16 February 2021.