Extracting Context-Sensitive Models in Inductive Logic Programming

摘要

Given domain-specific background knowledge and data in the form of examples, an Inductive Logic Programming (ILP) system extracts models in the data-analytic sense. We view the model-selection step facing an ILP system as a decision problem, the solution of which requires knowledge of the context in which the model is to be deployed. In this paper, "context" will be defined by the current specification of the prior class distribution and the client's preferences concerning errors of classification. Within this restricted setting, we consider the use of an ILP system in situations where: (a) contexts can change regularly. This can arise for example, from changes to class distributions or misclassification costs; and (b) the data are from observational studies. That is, they may not have been collected with any particular context in mind. Some repercussions of these are: (a) any one model may not be the optimal choice forall contexts; and (b) not all the background information provided may be relevant for all contexts. Using results from the analysis of Receiver Operating Characteristic curves, we investigate a technique that can equip an ILP system to reject those models that cannot possibly be optimal in any context. We present empirical results from using the technique to analyse two datasets concerned with the toxicity of chemicals (in particular, their mutagenic and carcinogenic properties). Clients can, and typically do, approach such datasets with quite different requirements. For example, a synthetic chemist would require models with a low rate of commission errors which could be used to direct efficiently the synthesis of new compounds. A toxicologist on the other hand, would prefer models with a low rate of omission errors. This would enable a more complete identification of toxic chemicals at a calculated cost of misidentification of non-toxic cases as toxic. The approach adopted here attempts to obtain a solution that contains models that are optimal for each such user according to the cost function that he or she wishes to apply. In doing so, it also provides one solution to the problem of how the relevance of background predicates is to be assessed in ILP.