Distributed multi-label feature selection using individual mutual information measures

作者：

Highlights：

•

摘要

Multi-label learning generalizes traditional learning by allowing an instance to belong to multiple labels simultaneously. This causes multi-label data to be characterized by its large label space dimensionality and the dependencies among labels. These challenges have been addressed by feature selection techniques which improve the final model accuracy. However, the large number of features along with a large number of labels call for new approaches to manage data effectively and efficiently in distributed computing environments. This paper proposes a distributed model to compute a score that measures the quality of each feature with respect to multiple labels on Apache Spark. We propose two different approaches that study how to aggregate the mutual information of multiple labels: Euclidean Norm Maximization (ENM) and Geometric Mean Maximization (GMM). The former selects the features with the largest L2-norm whereas the latter selects the features with the largest geometric mean. Experiments compare 9 distributed multi-label feature selection methods on 12 datasets and 12 metrics. Results validated through statistical analysis indicate that ENM is able to outperform the reference methods by maximizing the relevance while minimizing the redundancy of the selected features in constant selection time.

论文关键词：Multi-label learning,Feature selection,Mutual information,Distributed computing,Apache spark

论文评审过程：Received 11 June 2019, Revised 16 September 2019, Accepted 18 September 2019, Available online 20 September 2019, Version of Record 20 January 2020.

论文官网地址：https://doi.org/10.1016/j.knosys.2019.105052