Abstract:
|
When generating synthetic data for public release, attention must be given to the selection of appropriate synthesis models since only those features that are incorporated in the model will be reflected in the generated data. If the dataset has a longitudinal structure, it is not obvious which synthesis model should be used to account for the design. Using multiple imputation for missing data it has previously been shown that employing fixed effects at the imputation stage may adversely affect inferences obtained by an analyst wishing to use random effects to account for the hierarchy and vice versa. Since it is generally unknown which model users of the data will prefer, a synthesis model should be preferred that suits both analysis models. We evaluate several strategies for generating longitudinal synthetic datasets using extensive simulation studies. In our evaluations we consider both, the analytical validity and the risk of disclosure resulting from the different synthesis strategies. We find that synthesis models should be preferred that cannot be classified as pure random or fixed effects models. We illustrate our findings using data from the German IAB Establishment Panel.
|