Open-world probabilistic databases: Semantics, algorithms, complexity

作者:

摘要

Large-scale probabilistic knowledge bases are becoming increasingly important in academia and industry. They are continuously extended with new data, powered by modern information extraction tools that associate probabilities with knowledge base facts. The state of the art to store and process such data is founded on probabilistic databases. Many systems based on probabilistic databases, however, still have certain semantic deficiencies, which limit their potential applications. We revisit the semantics of probabilistic databases, and argue that the closed-world assumption of probabilistic databases, i.e., the assumption that facts not appearing in the database have the probability zero, conflicts with the everyday use of large-scale probabilistic knowledge bases. To address this discrepancy, we propose open-world probabilistic databases, as a new probabilistic data model. In this new data model, the probabilities of unknown facts, also called open facts, can be assigned any probability value from a default probability interval. Our analysis entails that our model aligns better with many real-world tasks such as query answering, relational learning, knowledge base completion, and rule mining. We make various technical contributions. We show that the data complexity dichotomy, between polynomial time and , for evaluating unions of conjunctive queries on probabilistic databases can be lifted to our open-world model. This result is supported by an algorithm that computes the probabilities of the so-called safe queries efficiently. Based on this algorithm, we prove that evaluating safe queries is in linear time for probabilistic databases, under reasonable assumptions. This remains true in open-world probabilistic databases for a more restricted class of safe queries. We extend our data complexity analysis beyond unions of conjunctive queries, and obtain a host of complexity results for both classical and open-world probabilistic databases. We conclude our analysis with an in-depth investigation of the combined complexity in the respective models.

论文关键词:Knowledge bases,Probabilistic databases,Semantics,Closed-world assumption,Open-world assumption,Inference,Credal sets,Learning,Data complexity,Dichotomy,Lifted inference

论文评审过程:Received 28 January 2020, Revised 19 December 2020, Accepted 10 February 2021, Available online 15 February 2021, Version of Record 15 March 2021.

论文官网地址:https://doi.org/10.1016/j.artint.2021.103474