All Times ET
Keywords: Health, Machine learning, Online survey, Probability panel, Prediction
Sample surveys are a major approach for collecting information from target populations aimed at producing reliable estimates, informing policy decisions, and providing data for scientific research. In the past decade or so, online (web) surveys have become an important tool for collecting data due to their advantages regarding timeliness and cost compared to the traditional data collection methods (e.g., in-person interviews). The advantage of online surveys is even more salient during the COVID-19 pandemic. The Research and Development Survey (RANDS) is a series of online health surveys based on probability-sampled panels, conducted by the National Center for Health Statistics, part of the U.S. Centers for Disease Control and Prevention.
Machine learning (ML) methods play a prominent role in the era of data science. Despite the increasing popularity of using online surveys for data collection and conducting traditional analyses, applying ML to online surveys is relatively rare. In this research, we investigate the utilities of established ML methods when applied to online surveys, using RANDS as a demonstrating example. Like many established national surveys, RANDS utilizes complex survey designs (e.g., sampling strata and clusters as well as unequal sampling weights) and has survey nonresponse (i.e., missing data) issues. RANDS includes a wide variety of health and health care related variables (e.g., health insurance coverage, diagnosed diabetes, telemedicine usage during the COVID-19 pandemic). We evaluate the performance of a variety of ML methods (e.g., regularized regressions, tree-based methods, deep-learning) for predicting important health outcomes (e.g., body mass index). We aim to develop practical ML strategies and guidelines that appropriately account for all the important yet sophisticated data features of online surveys for practitioners.