Abstract:
|
Raw electronic health record (EHR) data is disorganized and full of uncodified variables. Working directly with EHR data for statistical analysis is a challenge in of itself. Many data points are duplicated and are reliant on upon a very small set of validation criteria shown to the data entry personnel. Intimate knowledge of the data structure of the EHR is necessary for even the simplest of queries. At Cleveland Clinic, less than 5% of the EHR data are codified variables. The rest are identifiers, dates, and free-text entries. In order to provide the cleanest and most robust datasets for statistical analysis, numerous statistical techniques are used to clean, parse, map and validate the raw EHR data. The raw data is then taken from both the EHR and other disparate data source, mapped to discrete ontologies, cleaned & standardized, and finally deposited into a clinical research data repository. Approximately 185 tables from different data sources are condensed into 18 research-ready tables. Via this process, Cleveland Clinic is able to do live population exploration & produce clean datasets rapidly.
|