Mining non-derivable hypercliques

作者:Anna Koufakou

摘要

A hyperclique (Xiong et al. in Proceedings of the IEEE international conference on data mining, pp 387–394, 2003) is an itemset containing items that are strongly correlated with each other, based on a user-specified threshold. Hypercliques (HCs) have been successfully used in a number of applications, for example, clustering (Xiong et al. in Proceedings of the 4th SIAM international conference on data mining, pp 279–290, 2004) and noise removal (Xiong et al. in IEEE Trans Knowl Data Eng 18(3):304–319, 2006). Even though HC has been shown to respond well to datasets with skewed support distribution and low support threshold, it may still grow very large for dense datasets and lower h-confidence threshold. In this paper, we propose a new pruning method based on combining HCs and non-derivable itemsets (NDIs) (Calders and Goethals in Proceedings of the PKDD international conference on principles of data mining and knowledge discovery, pp 74–85, 2002) in order to substantially reduce the amount of generated HCs. Specifically, we propose a new collection of HCs, called non-derivable hypercliques (NDHCs). The NDHC collection is a lossless representation of HCs, that is, given the itemsets in NDHCs, we can generate the complete HC collection and their support, without additional scanning of the dataset. We present an efficient algorithm to mine all NDHC sets, NDHCMiner, and an algorithm to derive all HC sets and their support from NDHCs, NDHCDeriveAll. We experimentally compare our collection, NDHC with HC, with respect to runtime performance as well as total number of generated sets, using real and artificial data. We also show comparisons with another condensed representation of HCs, maximal hyperclique patterns (MHPs). Our experiments show that the NDHC collection offers substantial advantages over HCs, and even MHPs, especially for dense datasets and lower h-confidence values.

论文关键词:Hyperclique, Non-derivable itemset, Frequent itemset mining, Dense data, Skewed support distribution

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10115-013-0660-8