Keywords: U.S. Census Bureau, machine learning, predictive models
The Annual Capital Expenditures Survey (ACES) provides detailed and timely information on capital investment in structures and equipment by nonfarm businesses during the year. The data are used to improve the quality of economic indicators of business investments, as well as estimates of gross domestic product. Studies conducted by the U.S. Census Bureau have assessed procedures and targeted areas for Economic Directorate survey processing improvement. Recent research has shown machine learning is an effective technique to reduce the workload of analysts who review and edit write-in responses. SABLE (Scraping Assisted By LEarning) is a tool developed by the U.S. Census Bureau that will classify the “Other” category into multi-label descriptions: Structures, Equipment and Not Applicable. This tool will deploy a logistic regression model into a production system for the ACES 2018 survey year. Performance metrics such as classification accuracy are essential for assessment of the utility of this classifier. Both k-fold and leave-one-out cross validation are preferred methods for the evaluation of classification algorithm performance. This work aims to compare the two validation schemes, in the context of building effective machine learning models using text-based data. The aim of this work is to gain insights from our data to inform best practices for choosing methods of cross-validation.