Fast exhaustive subgroup discovery with numerical target concepts
作者:Florian Lemmerich, Martin Atzmueller, Frank Puppe
摘要
Subgroup discovery is a key data mining method that aims at identifying descriptions of subsets of the data that show an interesting distribution with respect to a pre-defined target concept. For practical applications the integration of numerical data is crucial. Therefore, a wide variety of interestingness measures has been proposed in literature that use a numerical attribute as the target concept. However, efficient mining in this setting is still an open issue. In this paper, we present novel techniques for fast exhaustive subgroup discovery with a numerical target concept. We initially survey previously proposed measures in this setting. Then, we explore options for pruning the search space using optimistic estimate bounds. Specifically, we introduce novel bounds in closed form and ordering-based bounds as a new technique to derive estimates for several types of interestingness measures with no previously known bounds. In addition, we investigate efficient data structures, namely adapted FP-trees and bitset-based data representations, and discuss their interdependencies to interestingness measures and pruning schemes. The presented techniques are incorporated into two novel algorithms. Finally, the benefits of the proposed pruning bounds and algorithms are assessed and compared in an extensive experimental evaluation on 24 publicly available datasets. The novel algorithms reduce runtimes consistently by more than one order of magnitude.
论文关键词:Subgroup discovery, Pattern mining, Numerical data, Pruning, Data structures, Data mining, Algorithms
论文评审过程:
论文官网地址:https://doi.org/10.1007/s10618-015-0436-8