Online Program

Thursday, February 19
PS1 Poster Session 1 & Opening Mixer Thu, Feb 19, 5:30 PM - 7:00 PM
Napoleon AB

Simulating Confidential Epidemiological Data Sets (303043)

*Ragheed Fadhil Al-Dulaimi, Hunter College 
Levi Waldron, City University of New York 

Keywords: Simulation, R programming, secure data, epidemiological data

Epidemiological data sets containing personally identifiable information often must be stored in secure, tightly controlled environments to protect subject confidentiality. These data sets may be complex in structure and may not be fully available until final collection and cleaning, delaying code development and data analysis. Furthermore, collaboration across multiple research centers may make development of a detailed data analysis plan difficult, especially when data access is limited to one site. We present an R package,“episim,” and generate simulations of such complex data sets while mimicking their summary statistics and idiosyncrasies. The package generates categorical variables with matching prevalences, continuous variables with matching quantiles, missing data, transformed variables such as discretized versions of continuous variables, and categorical variables with re-aggregated bins. Using a simple Excel spreadsheet as input, it facilitates simulation of a wide range of study designs and variable types by users with minimal programming skills.