An empirical study on selective partitioning dimensions for partition-based similarity joins

作者:

Highlights:

摘要

Real-world application data are usually distributed sparsely and non-uniformly in the high dimensional space that is huge in size. Hence, selection of effective partitioning dimensions is crucial for partition-based similarity joins. In this paper, we present two data partitioning algorithms for evaluations. PerDimSelect selects some dimension axes from the original perpendicular dimension axes pool, and maps each data point into the reduced dimension space. DiaDimSelect creates one-dimensional axis by combining some of original perpendicular dimensions, and maps each data point into the newly-created dimension. In the experiments, several measures are used to compare the performances of the algorithms including CPU cost, total response time, number of created buckets. In conclusion, DiaDimSelect shows better performance than PerDimSelect, for it creates much less partition buckets with the increasing number of partitioning dimensions, which leads to keep the IO cost less expensive while decreasing CPU cost considerably.

论文关键词:Partition-based similarity join,Partitioning dimension selection,Diagonal dimension

论文评审过程:Received 27 July 2006, Revised 14 October 2006, Accepted 22 February 2007, Available online 2 April 2007.

论文官网地址:https://doi.org/10.1016/j.datak.2007.02.006