Online Program

A Risk-based Methodology to De-identify Protected Health Information for the Heritage Health Prize

View Presentation *Luk Arbuckle, CHEO Research Institute
Khaled El Emam, CHEO Research Institute
Ben Eze, Privacy Analytics
Jonathan Gluck, Heritage Provider Network
Jeremy Howard, Kaggle
Gunes Koru, University of Maryland
Lisa Lisa Gaudette, Privacy Analytics
Emilio Neri, CHEO Research Institute
Sean Rose, Privacy Analytics

Keywords: re-identification, risk assessment, longitudinal, medical data, data disclosure, privacy

According to the US Health Insurance Portability and Accountability Act (HIPAA), the public disclosure of Protected Health Information (PHI) without patient consent is permitted if it is de-identified using accepted statistical methods to manage the risk of individual re-identification. The Heritage Provider Network (HPN), a provider of health care services in California, initiated the Heritage Health Prize (HHP) competition “to develop a predictive algorithm that can identify patients who will be admitted to the hospital within the next year, using historical claims data”. However, the complex longitudinal data from HPN for the HHP competition required the development of new methods to assess and evaluate the risk of re-identification. Five plausible re-identification attacks on this data were identified, and the probability of re-identification was evaluated for each. A de-identification algorithm was applied when the risk of re-identification was found to be above a pre-defined threshold. The final HHP competition dataset had a very small risk of re-identification, and was robust to violations of initial assumptions.