![IconGems-Print](images/IconGems-Print.png)
313 – Missing Data Methods for Epidemiologic Studies
Imputing Data That Are Missing at High Rates Using a Boosting Algorithm
Katherine Cauthen
Sandia National Laboratories
Gregory Lambert
Apple Inc.
Jaideep Ray
Sandia National Laboratories
Sophia Lefantzi
Sandia National Laboratories
Traditional multiple imputation approaches may perform poorly for datasets with high rates of missingness unless many m imputations are used. This paper implements an alternative machine learning-based approach to imputing data that are missing at high rates. We use boosting to create a strong learner from a weak learner fitted to a dataset missing many observations. This approach may be applied to a variety of types of learners (models). The approach is demonstrated by application to a spatiotemporal dataset for predicting dengue outbreaks in India from meteorological covariates. A Bayesian spatiotemporal CAR model is boosted to produce imputations, and the overall RMSE from a k-fold cross-validation is used to assess imputation accuracy.