Data Mining in Large Databases Using Domain Generalization Graphs

作者:Robert J. Hilderman, Howard J. Hamilton, Nick Cercone

摘要

Attribute-oriented generalization summarizes the information in a relational database by repeatedly replacing specific attribute values with more general concepts according to user-defined concept hierarchies. We introduce domain generalization graphs for controlling the generalization of a set of attributes and show how they are constructed. We then present serial and parallel versions of the Multi-Attribute Generalization algorithm for traversing the generalization state space described by joining the domain generalization graphs for multiple attributes. Based upon a generate-and-test approach, the algorithm generates all possible summaries consistent with the domain generalization graphs. Our experimental results show that significant speedups are possible by partitioning path combinations from the DGGs across multiple processors. We also rank the interestingness of the resulting summaries using measures based upon variance and relative entropy. Our experimental results also show that these measures provide an effective basis for analyzing summary data generated from relational databases. Variance appears more useful because it tends to rank the less complex summaries (i.e., those with few attributes and/or tuples) as more interesting.

论文关键词:data mining, knowledge discovery, machine learning, knowledge representation, attribute-oriented generalization, domain generalization graphs

论文评审过程:

论文官网地址:https://doi.org/10.1023/A:1008769516670