Multimodal Machine Learning for Natural Language Processing: Disambiguating Prepositional Phrase Attachments with Images

作者：Sebastien Delecraz, Leonor Becerra-Bonache, Benoit Favre, Alexis Nasr, Frederic Bechet

摘要

Although documents are increasingly multimodal, their automatic processing is often monomodal. In particular, natural language processing tasks are typically performed based on the textual modality only. This work extends the syntactic parsing task to the image modality in addition to text. In particular, we address the prepositional phrase attachment problem, a hard and semantic problem for syntactic parsers. Given an image and a caption, the proposed approach resolves syntactic attachment of prepositions in the parse tree according to both visual and lexical features. Visual features are derived from the nature and position of detected objects in the image that are aligned to textual phrases in the caption. A reranker uses this information to reorder syntactic trees produced by a shift-reduce syntactic parser. Trained on the Flickr-PP corpus which contains multimodal gold-standard attachments, this approach yields improvements over a text-only syntactic parser, in particular for the subset of prepositions that encode location, leading to an increase of up to 17 points of attachment accuracy.

论文关键词：Multimodal machine learning, Deep neural networks, Natural language processing, Prepositional phrase attachment resolution

论文评审过程：

论文官网地址：https://doi.org/10.1007/s11063-020-10314-8