Online Program

Return to main conference page

All Times ET

Program is Subject to Change

Wednesday, June 16
Wed, Jun 16, 10:30 AM - 12:00 PM
TBD
Dealing with Missing and Erroneous Data in Establishment Statistics

Imputation Strategies for Establishment Surveys Using Random Forest Models (308035)

Darcy Miller, USDA, NASS 
Benjamin Reist, NASA 
Jennifer Rhorer, USDA, NASS 
*Tyler Wilson, USDA, NASS 

Keywords: Machine Learning, Imputation, Random Forests, Item Nonresponse

Every two years at the National Agricultural Statistics Service (NASS), a Computer and Internet Use Section is included in the annual June Area Survey (JAS). In 2017, there were 14 questions involving internet, computer, and smartphone use. All but one question in 2017 contained ‘Yes’, ‘No’, or ‘Don’t Know’ (DK) response formats. These question suffer from high levels of item nonresponse and DK responses. In 2019, it was determined that with a new edit and imputation procedure using machine learning techniques, more than 1,800 values to each question with missing or reported DK could be imputed to improve estimates. The edit and imputation process was aided by a classification model that determined whether a DK is most likely a Yes or No for the question asking whether a farm operator had internet access. Three types of models were evaluated – Random Forest, Bootstrap Forest, and Boosted Forest. The type of model with the lowest misclassification rate and overall best fit was used to impute DK responses in the summary process. The outcome variable of this model was trained using responses to the recent internet use question in the 2017 Census of Agriculture (COA). This question, similar to several of the questions within the JAS, was assumed a valid proxy of determining whether a response is likely Yes or No to the internet access question on the JAS. More than 400 independent variables were examined in this research, including data from NASS, the U.S. Census Bureau, and Federal Communication Commission. The propensities from the model were used as parameters in Bernoulli distributions to make draws to impute Yes or No for each record with a missing or DK response. Records that were unable to be linked to COA were hot decked with imputation cells based on variables found within the final summary. The impact of this imputation process on the JAS will be discussed.