Abstract:
|
Sharing synthetic data instead of the original data is now a relatively common way of protecting the confidentiality of the individuals appearing in a dataset. Research on how to best generate these synthetic datasets is growing fast, and proposals now include various joint modeling approaches, fully conditional modelling strategies as well as complex deep learning methods. But measuring the confidentiality guarantee of such synthetic datasets is still quite tricky. Some measures, such as differential privacy, can be quantified a priori and relate to the process by which the synthetic dataset will be created. Other measures are computed post-hoc on the datasets to be released. Some of those take into account the process by which the synthetic data were generated, such as the Bayesian risk measure proposed by Reiter and collaborators. Others, such as the CAP statistic, do not. In this presentation, I will give an overview of the different proposed risk measures for synthetic data and discuss how they relate to each other, and what they really measure. This will lead us to ponder the fine line between inference and inferential disclosure.
|