On some transformations of high dimension, low sample size data for nearest neighbor classification

作者:Subhajit Dutta, Anil K. Ghosh

摘要

For data with more variables than the sample size, phenomena like concentration of pairwise distances, violation of cluster assumptions and presence of hubness often have adverse effects on the performance of the classic nearest neighbor classifier. To cope with such problems, some dimension reduction techniques like those based on random linear projections and principal component directions have been proposed in the literature. In this article, we construct nonlinear transformations of the data based on inter-point distances, which also lead to reduction in data dimension. More importantly, for such high dimension low sample size data, they enhance separability among the competing classes in the transformed space. When the classic nearest neighbor classifier is used on the transformed data, it usually yields lower misclassification rates. Under appropriate regularity conditions, we derive asymptotic results on misclassification probabilities of nearest neighbor classifiers based on the \(l_2\) norm and the \(l_p\) norms (with \(p \in (0,1]\)) in the transformed space, when the training sample size remains fixed and the dimension of the data grows to infinity. Strength of the proposed transformations in the classification context is demonstrated by analyzing several simulated and benchmark data sets.

论文关键词:Bayes risk, HDLSS data, High-dimensional geometry , Inter-point distances, Law of large numbers, Misclassification probability, Reproducing kernel Hilbert space , \(\rho \)-mixing sequences

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10994-015-5495-y