Knowledge Discovery in Grammatically Analysed Corpora
作者:Sean Wallis, Gerald Nelson
摘要
Collections of grammatically annotated texts (corpora), and in particular, parsed corpora, present a challenge to current methods of analysis. Such corpora are large and highly structured heterogeneous data sources. In this paper we briefly describe the parsed one-million word ICE-GB corpus, and the ICECUP query system. We then consider the application of knowledge discovery in databases (KDD) to text corpora. Following Cupit and Shadbolt (Proceedings 9th European Knowledge Acquisition Workshop, EKAW '96; Berlin: Springer Verlag, pp. 245–261, 1996), we argue that effective linguistic knowledge discovery must be based on a process of redescription or, more precisely, abstraction, based on the research question to be investigated. Abstraction maps relevant elements from the corpus to an abstract model of the research topic. This mapping may be implemented using a grammatical query representation such as ICECUP's Fuzzy Tree Fragments (FTFs). Since this abstractive process must be both experimental and expert-guided, ultimately a workbench is necessary to maintain, evaluate and refine the abstract model. We conclude with a pilot study, employing our approach, into aspects of noun phrase postmodifying clause structure. The data is analysed using the UNIT machine learning algorithm to search for significant interactions between domain variables. We show that our results are commensurable with those published in the linguistics literature, and discuss how the methodology may be improved.
论文关键词:linguistics, grammar, structured datasets, Text Corpora, redescription, cyclic knowledge discovery
论文评审过程:
论文官网地址:https://doi.org/10.1023/A:1011453128373