Performance Comparison of Imputation Methods for Mixed Data Missing at Random with Small and Large Sample Data Set with Different Variability
Asian Journal of Probability and Statistics,
Page 16-39
DOI:
10.9734/ajpas/2022/v20i2416
Abstract
One of the concerns in the field of statistics is the presence of missing data, which leads to bias in parameter estimation and inaccurate results. However, the multiple imputation procedure is a remedy for handling missing data. This study looked at the best multiple imputation methods used to handle mixed variable datasets with different sample sizes and variability along with different levels of missingness. The study employed the predictive mean matching, classification and regression trees, and the random forest imputation methods. For each dataset, the multiple regression parameter estimates for the complete datasets were compared to the multiple regression parameter estimates found with the imputed dataset. The results showed that the random forest imputation method was the best for mostly a sample of 500 irrespective of the variability. The classification and regression tree imputation methods worked best mostly on sample of 30 irrespective of the variability.
Keywords:
- Predictive mean matching
- classification and regression tree
- random forest
- multiple imputation chained equation
How to Cite
References
Rubin DB. Inferences and missing data. Biometrika. Dec. 1976;63(3):581-592.
Diggle P, Liang KY, Zeger SL. Analysis of longitudinal data. Oxford University Press; 1994.
Diggle P, Kenward MG. Informative drop-out in longitudinal data analysis. Applied Statistics. 1994;43(1):49-93.
Paul Allison. Imputation by predictive mean matching: Promise & Peril. Statistical Horizons; 2015.
Enders CK. Applied missing data analysis. The Guilford Press, New York, NY 10012; 2010.
Dong Y, Peng CYJ. Principled missing data methods for researchers. Springer Plus. 2013;2:222. DOI:https://doi.org/10.1186/ 2193-1801-2-222, [Online; accessed August 29,2017].
Hie-Choon Chung, Chien-Pai Han. Discriminant analysis when a block of observations is missing. Annals of the Institute of Statistical Mathematics; 2000.
Stef van Buuren. Imputation by classification and regression trees. R Documentation; 2018.
Harel O, Zhou XHA. Multiple imputation. Review of theory, implementation and software. J. Wiley & Sons, New York; 2005.
Schafer JL, Graham JW. Missing data: Our view of the state of the art. Psychological Methods. 2002;7: 147-177.
DOI:http://dx.doi.org/ 10.1037/1082-989X.7.2.147, [Online; accessed August 29,2017].
Alvira Swalin. How to handle missing data. Towards Data Science; 2018.
Richard Williams. Missing data part II: Multiple imputation. University of Notre Dame; 2015.
Van Buuren S, Groothuis-Oudshoorn CGM. Mice: Multivariate imputation by chained equations in R. Journal of Statistical Software. 2011;45(3).
Little RJ, D'Agostino R, Cohen ML, Dickersin K, Emerson SS, Farrar JT, et al. the prevention and treatment of missing data in clinical trials. N Engl J Med. 2012;367:1355-60. DOI:http://dx.doi.org/10.1056/NEJMsr1203730 [Online; accessed August 29, 2017]
A review of methods for missing data, by Therese D. Pigott, Published by Swets and Zeitlinger; 2001.
Missing data: Listwise vs. Pairwise, Published by Statistics Solutions; 2019.
Little RJ. Regression with missing X's: a review. Journal of the American Statistical Association. 1992; 87: 1227-1237.
Rubin DB, Schenker N. Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. Journal of the American Statistical Association. 1986;81:366-374.
Stephen A. Mistler, Craig K. Enders. A comparison of joint model and fully conditional specification imputation for multilevel missing data. Sage Journals; 2017.
Enders CK. Applied missing data analysis. The Guilford Press, New York, NY 10012; 2010.
Little RJA, Rubin DB. Statistical analysis with missing data. J. Wiley & Sons, New York; 1987.
Joseph G. Ibrahim, Haitao Chu, Liddy M. Chen. Basic concepts and methods for joint models of longitudinal and survival data. Journal of Clinical Oncology; 2010.
Alvira Swalin. How to handle missing data. Towards Data Science; 2018.
Joseph G Ibrahim, Haitao Chu, Liddy M Chen. Basic concepts and methods for joint models of longitudinal and survival data. Journal of Clinical Oncology. 2010;28(16):2796.
CRAN. Available:https://cran.r-project.org/web/packages/ SimMultiCorrData/SimMultiCorrData.pdf
Oketch TO. Performance of imputation algorithms on artificially produced missing at random data. Electronic Theses and Dissertations. Paper 3217; 2017. Available:http://dc.etsu.edu/etd/3217
How do I perform multiple imputations using predictive mean matching in R? UCLA Institute for Digital Research and Education; 2019.
Paul Allison. Imputation by predictive mean matching: Promise & Peril. Published by Statistical Horizons; 2015.
Will Koehrsen. Introduction to Bayesian linear regression. Towards Data Science; 2018.
Fei Tang, Hemant Ishwaran. Random forest missing data algorithms. Division of Biostatistics, University of Miami; 2017.
Tsunenori Ishioka. Imputation of missing values for semi-supervised data using the proximity in random forests. Semantic Scholar; 2012.
CRAN. Available:https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
Walter R Gilks. Markov chain monte carlo. Encyclopedia of Biostatistics. 2005;4.
Allison PD. Why you probably need more imputations than you think; November 9, 2012. Available:https://statisticalhorizons.com/more-imputations
Tsunenori Ishioka. Imputation of missing values for unsupervised data using the proximity in random forests. The Fifth International Conference on Mobile, Hybrid, and On-line Learning; 2013.
-
Abstract View: 249 times
PDF Download: 65 times