Abstract:
|
Recent federal initiatives are incentivizing the collection and linkage of electronic health records across clinics, hospitals, and healthcare systems. A key challenge to the use of electronically assembled cohorts is the inconsistent “languages” used in different healthcare systems and across time. For example, due to the financial incentives and heterogeneity in healthcare systems, different healthcare providers may use alternative medical codes to record the same diagnosis or procedure, limiting the transportability of phenotyping algorithms and statistical models across healthcare systems. In this talk, I formulate the idea of medical code translation into a statistical problem of inferring a mapping between two sets of multivariate, unit-length vectors learned from two healthcare systems respectively. The statistical problem is particularly interesting because the data is corrupted by a fraction of mismatch in the response-predictor pairs, whereas classical regression analysis tacitly assumes that the response and predictor are correctly linked. I propose a novel method for mapping recovery and establish theoretical guarantees for estimation and model selection consistency.
|