Abstract:
|
EHR data are widely used in modern healthcare research, containing useful information characterizing patients' clinical visits. Due to privacy concerns surrounding patient-level data sharing, most clinical data analyses are performed at individual sites. This leads to underpowered studies specific to a certain population, creating a need for methods which perform analyses across sites without sharing patient-level data. To address this, distributed algorithms have been developed to conduct analyses across sites by sharing only aggregated information, preserving patient privacy. We propose a communication-efficient distributed algorithm for performing hurdle regression on data stored in multiple sites. By modeling zero and positive counts separately, we account for zero-inflation in the outcome, which is common in characterizing patient hospitalization frequency. Our simulations show that our algorithm achieves high accuracy comparable to the oracle estimator using all patient-level data pooled together. We apply our algorithm to data from the Children's Hospital of Philadelphia to estimate how often a patient is likely to be hospitalized given data collected during clinical visits.
|