Handling missing values: A study of popular imputation packages in R
作者:
Highlights:
•
摘要
In real world data are often plagued by missing values which adversely affects the final outcome of the analysis based on such data. The missing values can be handled using various techniques like deletion or imputation. Of late, R has become one of the most preferred platform for carrying out data analysis, and its popularity is growing further. R provides various packages for handling missing values through imputation. The presence of multiple packages however, calls for an analysis of their comparative performance and examine their suitability for handling a given set of data. The performance of different R packages may differ for different datasets and may depend on the size of the dataset and richness of the missing values in the datasets. In this paper, the authors perform comparative study of the performance of the common R packages, namely VIM, MICE, MissForest, and HMISC, used for missing value imputation. The authors measured the performances of the said packages in terms of their imputation time, imputation efficiency and the effect on the variance. The imputation efficiency was measured in terms of the difference in predictive performance of a model built using original dataset vis-à-vis a dataset with imputed values. Similarly, the variance of the variables in the original dataset was compared that of corresponding variables in the imputed dataset. A missing value imputation package can be considered to be better if it consumes less imputation time and provides high imputation accuracy. Also in terms of variance, one would like to have the imputation package maintain the original variance of the variables. On analysing the four imputation packages on two datasets over three predictive algorithms–Logistic Regression, Support Vector Machines, and Artificial Neural Networks–it was observed that the performances varies depending on the size of the dataset, and the missing values present in them. The study highlights that certain missing value package used in conjunction with a given predictive algorithm provides better performance, which is again a function of the dataset characteristics.
论文关键词:Missing value handling,VIM,MICE,MissForest,HMISC,Imputation Time,Imputation Accuracy
论文评审过程:Received 21 December 2017, Revised 23 April 2018, Accepted 28 June 2018, Available online 6 July 2018, Version of Record 12 September 2018.
论文官网地址:https://doi.org/10.1016/j.knosys.2018.06.012