Regression imputation optimizing sample size and emulation: Demonstrations and comparisons to prominent methods

作者:

Highlights:

• The research addresses the problem of extensive missing values among some Compustat variables used in accounting research.

• We propose an accessible method for imputing values with high model parsimoniousness and predictiveness and imputed hybrid measures with high sample sizes and content validity.

• We demonstrate our approach, which is based on ordinary least squares regression, that leverages variables with larger sample sizes to inform those with low sample sizes.

• In comparing our approach to K-Nearest Neighbor, missForest and LASSO, the proposed technique was found superior using all four evaluative criteria.

• We apply the proprietary approach and the comparisons to 30 Compustat inputs missing greater than 25% of values.

• We report evaluative criteria for all final models and hybrid measures produced in the stud.

摘要

Missing input values weaken the ability of information systems (IS) researchers to make calculations, thereby reducing effective sample sizes and statistical power. Such technical problems with data cascade into scientific limitations resulting in the neglect of social and economic issues. Therefore, extensive missing values in data forces researchers to make crucial decisions, such as whether to impute and if so, what strategy to use. This study presents a single imputation approach that integrates and extends best practices for mitigating the effects of missing values. Using an array of missing value situations, we illustrate the Regression Imputation Optimizing Sample Size and Emulation (RIOSSE) method. The approach involves the derivation of an imputation model for each low-sample variable that leverages information available in large-sample sized inputs within the same data source. RIOSSE derives imputation equations with two competing goals in mind: 1) statistical power and 2) emulation. Direct comparisons demonstrate that RIOSSE is superior to three prominent multiple imputation methods (K-Nearest Neighbor, missForest, and LASSO) in two criteria each for achieving statistical power (parsimoniousness and sample size) and emulation (predictiveness and content validity). Further, 5-fold cross validation validated the head-to-head goal criteria comparisons. The paper contributes 1) a description of the RIOSSE method, 2) new imputation performance metrics and visualizations, 3) comparisons of our proposed method to three prominent multiple imputation methods, and 4) specified imputation models for 30 commonly used inputs to firm performance calculations.

论文关键词:Single imputation,Missing data,Sample size,Statistical power

论文评审过程:Received 20 August 2020, Revised 11 June 2021, Accepted 11 June 2021, Available online 17 June 2021, Version of Record 19 October 2021.

论文官网地址:https://doi.org/10.1016/j.dss.2021.113624