Abstract:
|
Currently, there are no reliable and effective mechanisms to share clinical data containing no clearly identifiable PHI without altering the data structure. We developed a novel statistical protocol, DataSifter, for de-identification of structured clinical data. It advocates Open Science by allowing Health System Administrators to share clinical data requested by researchers. The method performs iterative data manipulation that stochastically selects, nullifies, imputes, and exchanges feature values among the subjects. This process heavily relies on non-parametric imputation for mixed-type data to preserve the joint distribution. At each step, the DataSifter generates a complete dataset that closely resembles the original cohort. However, on an individual level, the feature values are substantially altered. This procedure drastically reduces the risk for subject re-identification by stratification, as meta-data for all subjects is repeatedly transformed, still preserving the overall population characteristics and data structure. Validation of the DataSifter on simulated and EHR case studies generated promising results in terms of privacy protection and inference reliability.
|