Keywords: integrated data, fused data, linked patients, machine learning, NHWS, EHR, PROs, type 2 diabetes
This research aimed to integrate disparate data sets, patient-reported survey data with EHR data, to provide a more complete view of disease characteristics and health outcomes among type 2 diabetes (T2D) patients. Data sources included a large nationally representative US ambulatory EHR database and data from the 2016 US National Health and Wellness Survey (NHWS). A matching algorithm using propensity scores was used to form a nationally representative sample of EHR patients, where predictors were demographics and comorbidities. Common variables for linked patients (i.e. those identified by a third party as the same person in both data sets) were analyzed to identify systematic differences between patient-reported and EHR data. Machine learning algorithms using the linked patients as a training set were explored to develop models that impute NHWS data into EHR. 1,733,003 active T2D patients were identified in EHR data and 4,113 were identified in NHWS. Machine learning using linked patients as training data improved the imputation of NHWS data into EHR non-linked patients. Initial analysis shows that machine learning algorithms hold promise for integrating disparate data sets.