EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data

摘要

Frequent itemset mining is considered a popular tool to discover knowledge from transactional datasets. It also serves as the basis for association rule mining. Several algorithms have been proposed to find frequent patterns in which the apriori algorithm is considered as the earliest proposed. Apriori has two significant bottlenecks associated with it: first, repeated scanning of input dataset and second, the requirement of generation of all the candidate itemsets before counting its support value. These bottlenecks reduce the effectiveness of apriori for large-scale datasets. Reasonable efforts have been made to diminish these bottlenecks so that efficiency can be improved. Especially, when the data size is larger, even distributed and parallel environments like MapReduce does not perform well due to the iterative nature of the algorithm that incurs high disk overhead. Apache Spark, on the other hand, is gaining significant attention in the field of big data processing because of its in-memory processing capabilities. Apart from utilizing the parallel and distributed computing environment of Spark, the proposed scheme named efficient apriori-based frequent itemset mining (EAFIM) presents two novel methods to improve the efficiency further. Unlike apriori, it generates the candidates ‘on-the-fly,’ i.e., candidate generation, and count of its support values go simultaneously when the input dataset is being scanned. Also, instead of using the original input dataset in each iteration, it calculates the updated input dataset by removing useless items and transactions. Reduction in size of the input dataset for higher iterations enables EAFIM to perform better. Extensive experiments were conducted to analyze the efficiency and scalability of EAFIM, which outperforms other existing methodologies.