Abstract:
|
Electronic health records (EHRs) offer great promises for advancing precision health. However, EHRs in their raw form can present significant analytical challenges– they contain multi-scale data from heterogeneous domains, can be structured or unstructured, and are collected at irregular time intervals and frequencies. Despite these challenges, to use raw EHRs for analyses would save significant time spent on pre-processing, thus encouraging real-world adoption in a clinical setting. EHRs also reflect inequity– some patients have differing amounts of data due to health-seeking behaviors, access to care, etc. This can contribute to biased data collection, and the consequence is that data for marginalized groups may be less informative due to fragmented care. This can be viewed as a missing data problem. There is a growing recognition that ubiquitous missing data in EHRs, even when analyzed using powerful statistical and machine learning algorithms, can yield biased findings and exacerbate health disparities. In this work we develop novel methods to simulate missing data in raw EHRs, and assess the impact via disease prediction models that incorporate word embedding algorithms.
|