Abstract:
|
We study semiparametric inference for merged data from multiple overlapping data sources. In public health data integration, studies to be combined have different target populations with overlaps. Also, subjects in a disease registry appear in other clinical studies as patients. A setting we consider is characterized by (1) duplication of the same units in multiple samples, (2) unidentified duplication across samples, (3) dependence due to finite population sampling. Applications include data synthesis of clinical trials, epidemiological studies, disease registries and health surveys. Main results are the extension of empirical process theory to biased and dependent samples with duplication. Specifically we develop the uniform law of large numbers and uniform central limit theorem with applications to general theorems for consistency, rates of convergence and asymptotic normality for infinite-dimensional M-estimators. Our method accounts for heterogeneity and bias in multiple data sets and guarantees generalizability of scientific findings from combined data. Our results are illustrated with simulation studies and a real data example using the Cox proportional hazards model.
|