Abstract:
|
We set up a tri-partite graph to infer underlying diseases in EHR data. A latent central layer in the graph represents unobserved diseases that explain symptoms that are recorded on patients, with patients and symptoms definining the other two layers of the model. THe graphical model is mapped to a family of probability models that can be characterized as feature allocation with features representing latent diseases, and one feature specific parameter that links a set of symptoms to each disease. The model can alternatively be described as a sparse factor model, or categorical matrix factorization. The representation as a graphical model is mainly useful for a graphical summary of the inferred structure. Using a Bayesian approach, available prior information on known diseases greatly improves identifiability of latent diseases. This includes known diagnoses for patients and known association of diseases with symptoms. We validate the proposed approach by simulation studies including mis-specified models and comparison with sparse latent factor models. In an application to Chinese electronic health records (EHR) data, we find results that agree with related clinical knowledge.
|