A pruning strategy to improve pairwise comparison-based near-duplicate detection

作者:Roya Hassanian-esfahani, Mohammad-javad Kargar

摘要

Efficient and accurate near-duplicate detection is a trending topic of research. Complications arise from the great time and space complexities of existing algorithms. This study proposes a novel pruning strategy to improve pairwise comparison-based near-duplicate detection methods. After parsing the documents into punctuation-delimited blocks called chunks, it decides between the categories of “near duplicate,” “non-duplicate” or “suspicious” by applying certain filtering rules. This early decision makes it possible to disregard many of the non-necessary computations—on average 92.95% of them. Then, for the suspicious pairs, common chunks and short chunks are removed and the remaining subsets are reserved for near-duplicate detection. Size of the remaining subsets is on average 4.42% of the original corpus size. Evaluation results show that near-duplicate detection with the proposed strategy in its best configuration (CHT = 8, τ = 0.1) has F-measure = 87.22% (precision = 86.91% and recall = 87.54%). Its F-measure is comparable with the SpotSig method with less execution time. In addition, applying the proposed strategy in a near-duplicate detection process eliminates the need for preprocessing. It is also tunable to achieve the intended levels of near duplication and noise suppression.

论文关键词:Near-duplicate detection, Pruning strategy, Similarity

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10115-018-1299-2