A framework for visual question answering with the integration of scene-text using PHOCs and fisher vectors
作者:
Highlights:
• Accuracy of our VQA model is improved by using both visual and textual features.
• Our model generates multi-word answers by employing dynamic pointer network.
• Text tokens are represented by PHOC and FV embeddings together with other features.
• Our model outperforms the previous models on VQA 2.0, Text-VQA and ST-VQA datasets.
摘要
•Accuracy of our VQA model is improved by using both visual and textual features.•Our model generates multi-word answers by employing dynamic pointer network.•Text tokens are represented by PHOC and FV embeddings together with other features.•Our model outperforms the previous models on VQA 2.0, Text-VQA and ST-VQA datasets.
论文关键词:Computer vision,Dynamic pointer networks,PHOC,Fisher vector,Visual Question Answering (VQA)
论文评审过程:Received 27 April 2020, Revised 1 October 2021, Accepted 26 October 2021, Available online 7 November 2021, Version of Record 12 November 2021.
论文官网地址:https://doi.org/10.1016/j.eswa.2021.116159