Abstract:
|
An institutional effort to make an international common data model de-identified (DI) for researchers within the institution. To adhere to Safe Harbor guidelines statistical DI was evaluated. The data consists of quasi-identifiers (QI, ex patient name), sensitive attributes (ex diagnosis code (DC)) and free text fields. Literature showed many algorithms, with aggregating or loss of data to the point of being statistically uninformative. An evaluation of use of the data directed the focus of the DI based off this use case. QI were grouped according with a minimum size of 10, where DC assessed for frequency occurring for any one DC within a group of patients >40%. MIST software was used to assess free text fields for identifiers was implemented on a sample of 200 cases, alterations were made to be effective at the 95% level accuracy. Dates for encounters were shifted at the patient level. A loss of 0.4% of patients of the 2.5 million cohort was seen due to these steps. Application of the many theoretical approaches in the healthcare space brings new challenges. Tradeoffs in scopes of potential analyses vs preserving patient privacy need to be examined on a use case-by-use case basis
|