Online Program Home
  My Program

All Times EDT

Abstract Details

Activity Number: 204 - Experimental Design
Type: Contributed
Date/Time: Tuesday, August 4, 2020 : 10:00 AM to 2:00 PM
Sponsor: Section on Statistical Learning and Data Science
Abstract #313681
Title: The Future Is Linked: Making Predictions with Data Sets Linked to Synthetic Populations
Author(s): Emily Hadley* and Caroline Kery and Georgiy Bobashev and Lauren Grattan
Companies: RTI International and RTI International and RTI International and RTI International
Keywords: Data Science; Synthetic Populations; Data Linkage; Data Privacy; Variable Selection

Synthetic populations are a useful tool to make predictions when original data sources are restricted or only accessible in an aggregated format. Researchers can map variables from data sources onto a synthetic population, resulting in a dataset that contains information sufficient to produce reliable statistical inference with quantifiable uncertainty while still adhering to data privacy restrictions. However, the choice of method to map the variables can considerably impact the accuracy of the predictions. We describe three methods for linking datasets with synthetic data: resampling, modeling predictors independently, and modeling predictors sequentially. We apply these methods to the prediction of the prevalence of Florida youth vaping by county and census tract using the 2018 Florida Youth Substance Abuse Survey (FYSAS) and synthetic records generated from the 5-Year American Community Survey (ACS). We find that resampling and sequential modeling most closely approximate the 2018 survey results, and that the sequential model captures more variability. We discuss opportunities to apply this work in other fields, including restricted settings like health records.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2020 program