Scene graph generation by multi-level semantic tasks

作者:Peng Tian, Hongwei Mo, Laihao Jiang

摘要

Understanding scene image includes detecting and recognizing objects, estimating the interaction relationships of the detected objects, and describing image regions with sentences. However, since the complexity and variety of scene image, existing methods take object detection or vision relationship estimate as the research targets in scene understanding, and the obtained results are not satisfactory. In this work, we propose a Multi-level Semantic Tasks Generation Network (MSTG) to leverage mutual connections across object detection, visual relationship detection and image captioning, to solve jointly and improve the accuracy of the three vision tasks and achieve the more comprehensive and accurate understanding of scene image. The model uses a message pass graph to mutual connections and iterative updates across the different semantic features to improve the accuracy of scene graph generation, and introduces a fused attention mechanism to improve the accuracy of image captioning while using the mutual connections and refines of different semantic features to improve the accuracy of object detection and scene graph generation. Experiments on Visual Genome and COCO datasets indicate that the proposed method can jointly learn the three vision tasks to improve the accuracy of those visual tasks generation.

论文关键词:Scene understanding, Semantic level, Visual relationship, Scene graph, Image captioning, Attention mechanism

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10489-020-02115-2