Constraint-based wrapper specification and verification for cooperative information systems

作者:

Highlights:

摘要

In this paper, we propose the use of semistructured constraints in wrappers to mitigate the impact of poor extraction accuracy on Cooperative Information System (CIS) data quality. Wrappers are a critical element of CISs whenever the constituent information systems publish semistructured text such as forms, reports, and memos rather than structured databases. The accuracy of CIS data that stem from text depends upon the wrappers as well as the accuracy of the underlying sources. Wrapper specification is the process of defining patterns (i.e. regular expressions) to extract information from semistructured text. Wrapper verification is the process of ensuring extraction accuracy—that the extracted information faithfully reflects the underlying source. We focus on the problem of extraction accuracy. We use constraints on semistructured data for both wrapper specification and verification. Consequently, we perform extraction and verification simultaneously. We apply the concept to wrappers for a Uniform Domain Name Dispute Resolution Policy (UDRP) CIS of arbitration decisions. UDRP decisions are currently distributed across arbitration authorities on three continents. The accuracy of data extracted using constraint-based specification and verification is measured by Type I and Type II errors.

论文关键词:Information extraction,Data quality,Semistructured data,Data constraints,Verification

论文评审过程:Received 30 May 2003, Revised 1 November 2003, Accepted 15 December 2003, Available online 31 January 2004.

论文官网地址:https://doi.org/10.1016/j.is.2003.12.006