Abstract:
|
Synthetic populations are a useful tool to make predictions when original data sources are restricted or only accessible in an aggregated format. Researchers can map variables from data sources onto a synthetic population, resulting in a dataset that contains information sufficient to produce reliable statistical inference with quantifiable uncertainty while still adhering to data privacy restrictions. However, the choice of method to map the variables can considerably impact the accuracy of the predictions. We describe three methods for linking datasets with synthetic data: resampling, modeling predictors independently, and modeling predictors sequentially. We apply these methods to the prediction of the prevalence of Florida youth vaping by county and census tract using the 2018 Florida Youth Substance Abuse Survey (FYSAS) and synthetic records generated from the 5-Year American Community Survey (ACS). We find that resampling and sequential modeling most closely approximate the 2018 survey results, and that the sequential model captures more variability. We discuss opportunities to apply this work in other fields, including restricted settings like health records.
|