A novel mapreduce algorithm for distributed mining of sequential patterns using co-occurrence information

作者:Sumalatha Saleti, R. B. V. Subramanyam

摘要

Sequential Pattern Mining (SPM) problem is much studied and extended in several directions. With the tremendous growth in the size of datasets, traditional algorithms are not scalable. In order to solve the scalability issue, recently few researchers have developed distributed algorithms based on MapReduce. However, the existing MapReduce algorithms require multiple rounds of MapReduce, which increases communication and scheduling overhead. Also, they do not address the issue of handling long sequences. They generate huge number of candidate sequences that do not appear in the input database and increases the search space. This results in more number of candidate sequences for support counting. Our algorithm is a two phase MapReduce algorithm that generates the promising candidate sequences using the pruning strategies. It also reduces the search space and thus the support computation is effective. We make use of the item co-occurrence information and the proposed Sequence Index List (SIL) data structure helps in computing the support at fast. The experimental results show that the proposed algorithm has better performance over the existing MapReduce algorithms for the SPM problem.

论文关键词:Big data, Co-occurrence map, Data mining, MapReduce framework, Sequential pattern mining

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10489-018-1259-2