Abstract:
|
Statistical disclosure control (SDC) is a data-masking technique that is used to anonymize survey or record data, which, often, removes the barriers for public release of this data. Rubin proposed the multiple imputation framework (Rubin, 1993) to create multiple synthetic samples of the population for which subjects are de-identified, yet valid statistical inference can be accomplished. Using this multiple imputation framework, we propose to use the tools of modeling and machine learning to create and/or impute the synthetic datasets. We will explore the usefulness of spline models, kernel regression, and regression trees in correlation, concordance, and inference validity when comparing synthetic data to the actual data. We will use both simulated data and real data to accomplish these goals.
|