High dimensional nearest neighbor searching

作者：

Highlights：

•

摘要

As databases increasingly integrate different types of information such as time-series, multimedia and scientific data, it becomes necessary to support efficient retrieval of multi-dimensional data. Both the dimensionality and the amount of data that needs to be processed are increasing rapidly. As a result of the scale and high dimensional nature, the traditional techniques have proven inadequate. In this paper, we propose search techniques that are effective especially for large high dimensional data sets. We first propose VA+-file technique which is based on scalar quantization of the data. VA+-file is especially useful for searching exact nearest neighbors (NN) in non-uniform high dimensional data sets. We then discuss how to improve the search and make it progressive by allowing some approximations in the query result. We develop a general framework for approximate NN queries, discuss various approaches for progressive processing of similarity queries, and develop a metric for evaluation of such techniques. Finally, a new technique based on clustering is proposed, which merges the benefits of various approaches for progressive similarity searching. Extensive experimental evaluation is performed on several real-life data sets. The evaluation establishes the superiority of the proposed techniques over the existing techniques for high dimensional similarity searching. The techniques proposed in this paper are effective for real-life data sets, which are typically non-uniform, and they are scalable with respect to both dimensionality and size of the data set.

论文关键词：High dimensional data,Nearest neighbor queries,Indexing,Similarity search,Approximate and progressive search,Non-uniform data,Scalability,Performance

论文评审过程：Received 29 April 2004, Revised 11 September 2004, Accepted 11 December 2004, Available online 2 February 2005.

论文官网地址：https://doi.org/10.1016/j.is.2005.01.001