A framework for visual question answering with the integration of scene-text using PHOCs and fisher vectors

作者：

Highlights：

• Accuracy of our VQA model is improved by using both visual and textual features.

• Our model generates multi-word answers by employing dynamic pointer network.

• Text tokens are represented by PHOC and FV embeddings together with other features.

• Our model outperforms the previous models on VQA 2.0, Text-VQA and ST-VQA datasets.

摘要

•Accuracy of our VQA model is improved by using both visual and textual features.•Our model generates multi-word answers by employing dynamic pointer network.•Text tokens are represented by PHOC and FV embeddings together with other features.•Our model outperforms the previous models on VQA 2.0, Text-VQA and ST-VQA datasets.

论文关键词：Computer vision,Dynamic pointer networks,PHOC,Fisher vector,Visual Question Answering (VQA)

论文评审过程：Received 27 April 2020, Revised 1 October 2021, Accepted 26 October 2021, Available online 7 November 2021, Version of Record 12 November 2021.

论文官网地址：https://doi.org/10.1016/j.eswa.2021.116159