Abstract:
|
In the two-phase stratified case-control sampling design, some covariates are available only for a subset of cases and controls who are selected based on the outcome and fully collected covariates. The analysis often focuses on fitting a logistic regression model to describe the relationship between the outcome and all covariates. We are also interested in characterizing the distribution of incomplete covariates conditional on fully observed ones in the underlying population, which is required for quantifying the predictive accuracy of the fitted model. It is desirable to include all subjects in the analysis to achieve consistency and efficiency of parameter estimation. We propose a novel semiparametric maximum likelihood approach under rare disease assumption, where estimates are obtained through a novel reparametrized profile likelihood technique. We develop the large sample theory for the proposed estimator, and show through simulation that it has improved efficiency compared with existing approach. We apply our method to the data from the Breast Cancer Detection and Demonstration Project, where one risk predictor, breast density, was measured only for a subset of study women.
|