Variance reduction in feature hashing using MLE and control variate method

作者：Bhisham Dev Verma, Rameshwar Pratap, Manoj Thakur

摘要

The feature hashing algorithm introduced by Weinberger et al. (2009) is a popular dimensionality reduction algorithm that compresses high dimensional data points into low dimensional data points that closely approximate the pairwise inner product. This algorithm has been used in many fundamental machine learning applications such as model compression (Chen et al. 2015), spam classification (Weinberger et al. 2009), compressing text classifiers (Joulin et al. 2016), large scale image classification (Mensink et al. 2012). However, a limitation of this approach is that the variance of its estimator for the inner product tends to be large for small values of the reduced dimensions, making the estimate less reliable. We address this challenge and suggest two simple and practical solutions in this work. Our approach relies on control variate (CV) and maximum likelihood estimator (MLE), which are popular variance reduction techniques used in statistics. We show that these methods lead to significant variance reduction in the inner product similarity estimation. We give theoretical bounds on the same and complement it via extensive experiments on synthetic and real-world datasets. Given the simplicity and effectiveness of our approach, we hope that it can be adapted in practice.

论文关键词：Dimensionality reduction, Variance reduction, Control variate, Maximum likelihood estimator, Sketching

论文评审过程：

论文官网地址：https://doi.org/10.1007/s10994-022-06166-z