Abstract:

We consider the situation where there is a known regression model that can be used to predict an outcome, Y, from a set of predictor variables X. A new variable B is expected to enhance the prediction of Y. A modest sized dataset of size n containing Y, X and B is available, and the challenge is to build an improved model for YX, B that uses both the available dataset and the known model for YX. We propose a synthetic data approach, which consists of creating m additional synthetic data observations, and then analyzing the combined dataset of size n+m to estimate the parameters of the YX, B model. This combined dataset has missing values of B for m of the observations, and is analyzed using methods that can handle missing data. We illustrate the method using multiple imputation in an example and some simulations. To provide analytical justification, we consider two special cases, where we show that our approach with very large m gives identical asymptotic variance for the parameters of the YX, B model as an alternative published constrained maximum likelihood estimation approach. This justification and the methods broad applicability makes it appealing in more general cases.
