Learning soft domain constraints in a factor graph model for template-based information extraction

作者:

Highlights:

摘要

The ability to accurately extract key information from textual documents is necessary in several downstream applications e.g., automatic knowledge base population from text, semantic information retrieval, question answering, or text summarization. However, information extraction (IE) systems are far from being errorless and in some cases commit errors that seem obvious to a human expert as they violate common sense or domain knowledge.Towards improving the performance of IE systems, we focus on the question of how domain knowledge can be incorporated into IE models to reduce the number of spurious extractions. Starting from the assumption that such domain knowledge cannot be incorporated explicitly and manually by domain experts due to the amount of effort and technical complexities involved, we propose a machine learning approach in which domain constraints are acquired as a byproduct of learning a model that learns to extract key information in a supervised setting. We frame the task as a template-based information extraction problem in which several dependent slots need to be automatically filled and propose a factor graph based approach to model the joint distribution of slot assignments given a text. Beyond using standard textual features in factors that score the compatibility of slot fillers in relation to the text, we use additional features that are text-independent and capture soft domain constraints. During the training process, these constraints receive a weight as part of the parameter learning process indicating how strongly a constraint should be enforced. These domain constraints are thus ‘soft’ in the sense that they can be violated, but the system learns to penalize solutions that violate them. The soft constraints we introduce come in two flavors: on the one hand we incorporate information about the mean of numerical attributes and use features that indicate how far a certain value is from the mean. We call these features single slot soft constraints. On the other hand, we model the pairwise compatibility between slot filler assignments independent of the textual context, thus modeling the (domain) compatibility of the slot assignments. We call the latter ones pairwise slot soft constraints. As main result of our work, we show that learning pairwise slot soft constraints improves the performance of our extraction model compared to single slot soft constraints by up to 6 points in F1, leading to an F1 score of 0.91 for individual template types. Further, the human readable output format of our model enables the extraction and interpretation of the learned soft constraints. Based on this, we show in an evaluation by domain experts that more than 68% of the learned soft constraints are regarded as plausible.

论文关键词:Template-based information extraction,Slot-filling,Probabilistic graphical models,Learning domain constraints,Database population

论文评审过程:Received 24 January 2019, Revised 25 October 2019, Accepted 1 November 2019, Available online 14 November 2019, Version of Record 29 February 2020.

论文官网地址:https://doi.org/10.1016/j.datak.2019.101764