Unsupervised dimensionality reduction for very large datasets: Are we going to the right direction?

作者:

Highlights:

摘要

Given a set of millions or even billions of complex objects for descriptive data mining, how to effectively reduce the data dimensionality? It must be performed in an unsupervised way. Unsupervised dimensionality reduction is essential for analytical tasks like clustering and outlier detection because it helps to overcome the drawbacks of the “curse of high dimensionality”. The state-of-the-art approach is to preserve the data variance by means of well-known techniques, such as PCA, KPCA, SVD, and other techniques based on those that have been mentioned, such as PUFS. But, is it always the best strategy to follow? This paper presents an exploratory study performed to compare two distinct approaches: (a) the standard variance preservation, and; (b) one alternative, Fractal-based solution that is rarely used, for which we propose one fast and scalable Spark-based algorithm using a novel feature partitioning approach that allows it to tackle data of high dimensionality. Both strategies were evaluated by inserting into 11 real-world datasets, with up to 123.5 million elements and 518 attributes, at most 500 additional attributes formed by correlations of many kinds, such as linear, quadratic, logarithmic and exponential, and verifying their abilities to remove this redundancy. The results indicate that, at least for large datasets of dimensionality with up to ∼1,000 attributes, our proposed Fractal-based algorithm is the best option. It accurately and efficiently removed the redundant attributes in nearly all cases, as opposed to the standard variance-preservation strategy that presented considerably worse results, even when applying the KPCA approach that is made for non-linear correlations.

论文关键词:Unsupervised dimensionality reduction,Descriptive data mining,Very large datasets,Fractal theory

论文评审过程:Received 27 September 2019, Revised 8 February 2020, Accepted 15 March 2020, Available online 24 March 2020, Version of Record 16 April 2020.

论文官网地址:https://doi.org/10.1016/j.knosys.2020.105777