Abstract:
|
Often, real world data sets are not released for privacy and confidentiality reasons in applications for health care, official statistics, human rights conflicts, among others. However, synthetic versions of such data sets are released that preserve, for example, differential privacy, while at the same time preserving data utility. Such synthetic data sets are often analyzed by record linkage algorithms to estimate, for example, the number of people in a sample or population. Given these motivations, one open question is the following: given a synthetic data set and an unknown privacy algorithm, does the synthetic data set (under some privacy setting) still have data utility? In this talk, we critically assess performance bounds using the Kullback-Leibler (KL) divergence under a general record linkage framework to provide guidance in privacy settings. We provide an upper bound using the KL divergence and a lower bound on the minimum probability of misclassifying a latent entity. We give insight for when our bounds hold using simulated data and potential privacy implications for synthetic data release.
|