Abstract:
|
Big data and analytics foster new knowledge. But they challenge producers of sensitive data, who aim to assure confidentiality in publicly accessible data. There can be tradeoffs, such as re-identification risks, when sensitive data are made publically available. One strategy to reduce re-identification is to release synthetic or partially synthetic data. However, these releases could distort the true underlying data. This presentation will discuss analyses assessing the extent to which synthetic data may distort true underlying data. Recently the Data Linkage Program at the National Center for Health Statistics released partially synthetic public-use linked mortality files. To create the public-use version of the restricted-use file, a re-identification risk scenario was conducted to determine records at risk for disclosure. Then values for select records were perturbed. To demonstrate the comparability between the public and restricted-use versions of the linked mortality files, estimated relative hazards for all-cause and cause-specific mortality were calculated. The results reveal key analytical considerations and the importance of such work in the context of data quality.
|