Abstract:
|
We consider the situation where there is a well-established regression model [Y|X], using a set of commonly available risk predictors X to predict an important outcome Y. A modest sized dataset of size n containing Y, X, and B is available, where B is a new variable that is thought to be important and would enhance the prediction of Y. The challenge is to build a good model for [Y|X,B] that uses both the available dataset and the known model for [Y|X]. One popular proposal in the literature to achieve this is the constrained maximum likelihood (CML) approach, by maximizing the likelihood for [Y|X,B] subject to the constraints on the parameters from [Y|X]. We propose a synthetic data approach, which consists of creating m additional synthetic data observations, and then analyzing the combined dataset of size n+m to estimate the parameters of the model [Y|X,B]. In two special cases we show that the synthetic data approach with large m gives identical asymptotic variance for the parameters of the [Y|X,B] model as the CML approach. This provides some theoretical justification for the synthetic data approach, and given its broad applicability makes the approach very appealing.
|