Online Program

Return to main conference page
Friday, May 18
Machine Learning
Statistical Machine Learning Applications in Surveys
Fri, May 18, 3:30 PM - 5:00 PM
Regency Ballroom A
 

Classification and Regression Trees and Forests for Imputing Data from Sample Surveys (304323)

Presentation

*MoonJung Cho, U.S. Bureau of Labor Statistics 
John Eltinge, U.S. Census Bureau 
Yuanzhi Li, University of Wisconsin-Madison 
Wei-Yin Loh, University of Wisconsin-Madison 

Keywords: incomplete predictor variable, item nonresponse, predicted-mean model, response propensity, U.S. Consumer Expenditure Survey

Analysis of sample survey data often requires adjustments to account for missing values in the outcome variables of principal interest. Standard adjustment methods based on item imputation or on propensity weighting factors rely heavily on the availability of auxiliary variables for both responding and non-responding units. Their application can be challenging in cases for which the auxiliary variables are numerous and are themselves subject to substantial incomplete-data problems. This paper shows how classification and regression trees and forests can overcome these difficulties and compares them with traditional likelihood methods in terms of estimation bias and mean squared error. The development is centered on a component of income data from the U.S. Consumer Expenditure Survey, which is subject to a relatively high rate of item missingness. Classification tree and forest methods are used to model the unit-level propensity for item missingness in the income component. Regression tree and forest methods are used to model the conditional mean structure of the income component. Both sets of methods are then used to produce estimators of the mean of the income component,adjusted for item nonresponse. Thirteen methods for estimating a population mean are compared in a series of simulation experiments. The results show that if the number of auxiliary variables with missing values is not small, or if they have substantial missingness, likelihood methods can be rendered impracticable or even inapplicable. Tree and forest methods are always applicable, are relatively fast, and have higher efficiency than likelihood methods (as much as 35% lower root mean squared error) under real-data situations with incomplete-data patterns similar to that in the abovementioned survey. Their efficiency loss under conditions ideal for likelihood methods is observed to be between 10-25%.