TCIC_FS: Total correlation information coefficient-based feature selection method for high-dimensional data

作者:

Highlights:

摘要

High-dimensional data have been a challenging problem in classification. Feature selection works as a filter to remove irrelevant or redundant features and has made comparative progress. However, this problem is still challenging because current methods consider only the correlation between two variables while leaving the correlation among multiple variables largely unsolved, and multivariate interactions can contain joint information that cannot be obtained pairwise. Furthermore, many feature selection methods require hyperparameter settings, which require prior knowledge and lack interpretability. Focusing on the above problems, this paper proposes the total correlation information coefficient-based feature selection (TCIC_FS) method to select the optimal solution, which can avoid setting hyperparameters and fully consider the correlations among multiple variables. First, based on a Gaussian copula, the total correlation information coefficient (TCIC) is proposed to evaluate the correlations among multiple variables. Compared with the existing multivariate correlation methods, TCIC can measure a wider range of multivariate correlations, including linear, nonlinear, functional, and nonfunctional correlations. Second, a novel evaluation mechanism based on TCIC is proposed to measure the relevance between features and classes and the redundancy between a single feature and a selected feature subset. Finally, the TCIC_FS method is constructed based on the TCIC and the evaluation mechanism. Compared with the baseline values, the TCIC_FS method has the lowest time complexity and the smallest optimal feature subset obtained by single selection. Therefore, TCIC_FS is more suitable for processing high-dimensional data.

论文关键词:00-01,99-00,Multivariate correlation,Feature selection,High dimensional data,Gaussian copula,Evaluation mechanism,Recommendation system

论文评审过程:Received 29 March 2021, Revised 17 August 2021, Accepted 18 August 2021, Available online 20 August 2021, Version of Record 9 September 2021.

论文官网地址:https://doi.org/10.1016/j.knosys.2021.107418