Effectiveness of optimal incremental multi-step nearest neighbor search

摘要

The development of techniques that facilitate effective similarity search is important for many applications such as multi-media databases, content-based image retrieval, molecular biology, medical imaging, and object recognition, among others. Two of the common operations in this context are range queries and k-nearest neighbor search in high-dimensional space. However, the distance measures used to determine the dissimilarities between high-dimensional feature vectors are often expensive to compute. To reduce the number of expensive distance calculations in the search process, Korn, Sidiropoulos, Faloutsos, Siegel, and Protopapas (1996) proposed a multi-step algorithm, which involves two stages: filtering and refinement. It employs an easily computable lower-bound distance measure to filter out a candidate set in the filtering stage and confine the expensive distance computation to a small candidate set in the refinement stage. This algorithm was later improved by Seidl and Kriegel (1998) to produce optimal-sized candidate set in the filtering stage; the improved algorithm is said to be filtering optimal. However, the improved algorithm cannot produce the result incrementally in the refinement stage. The improved algorithm can only start to produce results after the whole search process stops, which is a disadvantage in real applications. In this paper, we experimentally demonstrate the applicability and effectiveness of an extended version of the algorithm that can produce the nearest neighbors incrementally in an optimal way in the sense that a nearest neighbor is output as soon as it can be determined using the existing information; thus, nearest neighbors are produced in order. Our algorithm is both filtering and refinement optimal, and well serves real applications. We have already proved the optimality of the proposed extended algorithm (Zhang, Alhajj, & Rokne, 2008), and in here we empirically demonstrate its independence on the number of nearest neighbors and its effectiveness in early retrieving results as compared to the previous algorithm.