Abstract:
|
Data privacy is a difficult and poorly understood problem that is further exacerbated by the large amount of data being collected, stored, analyzed and shared across multiple domains in the age of "big data". Differential privacy (DP) has emerged as a key rigorous definition of privacy or more specifically as a framework that can inform the design of privacy mechanisms with specified disclosure risk, in presence of arbitrary external information, but it is often criticized for lack of utility. We present two algorithms for releasing sparse high dimensional histograms in a differentially private manner. These algorithms are designed as one step of synthetic data generation, which is a methodology for reducing the risk of disclosing sensitive information when data are released to the public. To ensure privacy, all the entries in a histogram must be perturbed with noise, which is inefficient when the dimension is high. Our algorithms improve the efficiency for sparse high dimensional histograms by working only with the non-zero entries and a small proportion of the zero entries. We also explore the tradeoff between the utility and privacy of the synthetic data.
|