Researchers are routinely confronted with the task of extracting and integrating effectively information from multiple covariate-rich datasets. Such datasets give rise to different studies, each targeting a common set of scientific questions. For example, we consider multiple observational studies with potentially a large number of covariates to adjust for in a regression model. Each study deploys a regularization based method such as the Lasso to report findings based on a sparse model of covariates. Regularized model selection would often result in an over-fitted model that lead to an asymptotically biased estimate for treatment effect. We aim to offer guidance on what summer statistics must be recorded from these already available studies to inform data collection and conduct efficient model aggregation in a follow-up investigation.
We will introduce a carve-and-aggregate strategy using summary level information to obtain estimates free of selection bias. Our estimate is by construct statistically more efficient than those based only on the follow-up samples. We provide the efficiency gains of our estimate over those produced by splitting.
|