A framework for visual question answering with the integration of scene-text using PHOCs and fisher vectors

作者:

Highlights:

• Accuracy of our VQA model is improved by using both visual and textual features.

• Our model generates multi-word answers by employing dynamic pointer network.

• Text tokens are represented by PHOC and FV embeddings together with other features.

• Our model outperforms the previous models on VQA 2.0, Text-VQA and ST-VQA datasets.

摘要

•Accuracy of our VQA model is improved by using both visual and textual features.•Our model generates multi-word answers by employing dynamic pointer network.•Text tokens are represented by PHOC and FV embeddings together with other features.•Our model outperforms the previous models on VQA 2.0, Text-VQA and ST-VQA datasets.

论文关键词:Computer vision,Dynamic pointer networks,PHOC,Fisher vector,Visual Question Answering (VQA)

论文评审过程:Received 27 April 2020, Revised 1 October 2021, Accepted 26 October 2021, Available online 7 November 2021, Version of Record 12 November 2021.

论文官网地址:https://doi.org/10.1016/j.eswa.2021.116159