Hubness-aware shared neighbor distances for high-dimensional \(k\)-nearest neighbor classification

作者:Nenad Tomašev, Dunja Mladenić

摘要

Learning from high-dimensional data is usually quite challenging, as captured by the well-known phrase curse of dimensionality. Data analysis often involves measuring the similarity between different examples. This sometimes becomes a problem, as many widely used metrics tend to concentrate in high-dimensional feature spaces. The reduced contrast makes it more difficult to distinguish between close and distant points, which renders many traditional distance-based learning methods ineffective. Secondary distances based on shared neighbor similarities have recently been proposed as one possible solution to this problem. However, these initial metrics failed to take hubness into account. Hubness is a recently described aspect of the dimensionality curse, and it affects all sorts of \(k\)-nearest neighbor learning methods in severely negative ways. This paper is the first to discuss the impact of hubs on forming the shared neighbor similarity scores. We propose a novel, hubness-aware secondary similarity measure \(simhub_s\) and an extensive experimental evaluation shows it to be much more appropriate for high-dimensional data classification than the standard \(simcos_s\) measure. The proposed similarity changes the underlying \(k\)NN graph in such a way that it reduces the overall frequency of label mismatches in \(k\)-neighbor sets and increases the purity of occurrence profiles, which improves classifier performance. It is a hybrid measure, which takes into account both the supervised and the unsupervised hubness information. The analysis shows that both components are useful in their own ways and that the measure is therefore properly defined. This new similarity does not increase the overall computational cost, and the improvement is essentially ‘free’.

论文关键词:Hubs, Metric learning, Curse of dimensionality, \(k\)-nearest neighbor, Classification, Shared neighbor distances

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10115-012-0607-5