Triple attention network for sentimental visual question answering

作者:

Highlights:

摘要

Visual Question Answering (VQA) and Visual Sentiment Analysis (VSA) are recently popular research fields in multimedia analysis using deep learning, but little effort has been put in attempting to close the gap between them. Better image understanding can be achieved by analyzing sentimental attributes from the different regions of an image. This paper proposes the Triple Attention Network (TANet) that attends to features of an image, a text question and a string of distinct multiple localized sentimental visual attributes, in a triple attention mechanism, in order to generate a fully affective answer. The separate experiments demonstrate how two customized image datasets can be used to train a VQA model that employs Long short-term memory (LSTM) and convolutional neural network (CNN) feature attention techniques for the text question and the sentimental attributes. The additional attention to the sentimental attributes causes the model to focus on more relevant regions of the image, which results in better image understanding and improved quality of the answer. The Hadamard product is modified to handle the three attended variables during feature fusion. The results of the experiments clearly show that high classification accuracy levels can be achieved together with a multi-attribute affective answer, and our model outperforms recent VSA and VQA baseline models. The proposed model is a significant step towards the realization of machines that can comprehend perfect natural language just like humans.

论文关键词:

论文评审过程:Received 12 December 2018, Revised 23 May 2019, Accepted 23 September 2019, Available online 4 October 2019, Version of Record 1 November 2019.

论文官网地址:https://doi.org/10.1016/j.cviu.2019.102829