Abstract:
|
The utility of a data set that has been altered to preserve confidentiality can be assessed by general or specific measures. The former summarize differences between the distributions of the real and altered data while the latter compare differences between results from particular analyses using the two data sets. We extend previous work on utility for the specific case of synthetic data and exhibit our measures for two real data examples with synthesis. Methods are tailored specifically to improve usability for researchers seeking to generate analytically useful synthetic data. All methods in this paper are implemented in the synthpop package in R. Our extension includes a new statistic, the adjusted propensity mean squared error, that involves: (i) derivation and standardization of the statistic by a null expected value, (ii) the use of non-parametric CART models to estimate propensity scores values, and (iii) the use of the entire data set rather than only the changed variables in computing the utility measures. For specific utility measures, we use confidence interval overlap percentage, and introduce standardized measures for improved utility estimation under certain analyses.
|