Abstract:
|
Electronic health records (EHRs) are collected as a routine part of healthcare delivery and have great potential to be utilized to improve patient health outcomes. In their complex form, they do not come in a neat data matrix with well-defined features. Instead, the data can be in the form of sequences of medical codes ordered by their time stamp. We introduce a means to simulate missing data in EHRs of this form– namely, we introduce mechanisms to simulate Missing Completely at Random, Missing At Random, and Missing Not At Random. We account for potentially causal relationships between medical events by incorporating the use of a medical knowledge graph to cluster related events. We also assess the impact of missing data on various marginalized groups on disease prediction models. We find that the use of the knowledge graph to simulate missing data has a significant impact on disease prediction models, thus illustrating the need to account for potentially causal relationships when simulating missing data in complex EHRs.
|