Keywords: data provenance, missing data, phenotype, outcome misclassification
Using data generated as a by-product of electronic interactions to improve health and healthcare is a priority for the US healthcare system in the 21st century. Informaticians have played a leading role in the process of extracting “real world data” from electronic systems, with statisticians playing a more peripheral part. However, statistical insights on sampling and inference are key to drawing valid conclusions based on these messy and incomplete data sources. In this talk, I will use my research on electronic health records (EHR)-based phenotyping to motivate a discussion of the roles of informatics, statistics, and data science in the process of learning from healthcare data. EHR-based phenotyping is hampered by complex missing data patterns and heterogeneity across patients and healthcare systems, features which have been largely ignored by existing phenotyping methods. As a result, not only are EHR-derived phenotypes expected to be imperfect, but they often feature exposure-dependent differential misclassification, which can bias analyses towards or away from the null. I will discuss novel and existing approaches to EHR-based phenotyping, as well as statistical methods to correct for phenotyping error in analyses. The overall goal of this presentation is to use the example of phenotyping to illustrate the unique contribution of statistics to the process of generating evidence from EHR.