Abstract:
|
Modern data from large observational databases are subject to complex and likely informative coarsening mechanisms that limit the validity of statistical analyses. While nonparametric bounds and sensitivity analyses are a conservative way forward in the absence of additional data, these approaches provide little hope for identification of full data functionals of interest. A promising design strategy known as double sampling involves allocating resources for collecting additional data on a subsample of subjects for whom missing or coarsened data was initially encountered. We present a general framework to describe double sampling designs, delineate conditions under which the full data law is identified, and provide a general procedure for constructing semiparametric efficient estimators of full data functionals. Moreover, focusing on a causal inference example with missing outcomes, treatment, or confounders, we discuss the potential for additional efficiency gain in targeting the average treatment effect by collecting auxiliary variables. We also discuss theoretically optimal double sampling probabilities, and demonstrate the approach and relevant tradeoffs in a simulation study.
|