Abstract:
|
Statistical agencies aim to release informative data to the public, but they need to avoid disclosure of respondents' information, which requires more than removing direct identifiers. Usually a perturbed data set is generated from original data using statistical disclosure avoidance techniques and released. However measuring disclosure risk is difficult as it can occur in many different forms and depends on behavior of intruders. As a result the tasks of designing the perturbation mechanism and assessing data utility of the perturbed data are quite challenging. In this paper we propose a novel and rigorous measure of identification disclosure risk and use it to articulate clear and realistic disclosure control goals. Then we present unbiased post-randomization methods for achieving those goals. Specifically, the probability of correct identification of any sample units will not be larger than a pre-chosen value. We also assess the utility of perturbed data and show that the added variance due to our perturbation procedure is trivial comparing to sampling variance. Finally, as an illustrative example we apply our procedure to a public use micro sample released by US Census Bureau.
|