Online Program

Return to main conference page

All Times ET

Program is Subject to Change

Tuesday, June 15
Tue, Jun 15, 9:30 AM - 11:00 AM
TBD
Much Ado About Nothing: The Problem of Missing Data (And Some Ways to Handle It)

Nearest Neighbor Multiple Imputation: Problems and Potential Solutions (307998)

*Rebecca Andridge, The Ohio State University 
Katherine Jenny Thompson, U.S. Census Bureau 

Keywords: nearest neighbor, multiple imputation, hot deck, approximate Bayesian bootstrap

Missing data is an inevitability in large-scale surveys. In establishment statistics, sampled units often can easily provide key items such as payroll and revenue totals, but either cannot or choose not to provide more detailed items, such as the breakdown of revenue by product type. For this type of item nonresponse, imputation is an appealing option. The U.S. Census Bureau has historically used nearest-neighbor or hot deck imputation for many types of establishment data, since measures such as payroll and revenue tend to be highly skewed, and using these methods removes the need to parametrically model these values in an imputation model. Recently, these methods have been used for multiple imputation, enabling simple variance estimation via the so-called Rubin’s Combining Rules. The Approximate Bayesian Boostrap (ABB) is an attractive and simple-to-implement algorithm that can be used to make hot deck methods “proper” for multiple imputation. Essentially, the responding units are bootstrapped before donors are selected, so that the set of possible donors for a nonrespondent varies across multiply imputed datasets. In concept, the ABB should work for nearest neighbor multiple imputation; bootstrapping the respondents would mean that each nonrespondent’s one “closest” donor will not be available for every imputation. However, we will show that this is not the case, and that nearest neighbor multiple imputation with the ABB does not produce valid variance estimates when using Rubin’s combining rules. In fact, variances are underestimated, and this underestimation is more severe the stronger the relationship between the outcome being imputed and the auxiliary variable used to measure closeness. We illustrate the problem via simulation and through application to Economic Census data. We provide some guidance on alternative versions of nearest neighbor multiple imputation that may be used to overcome this problem.