Abstract:
|
Identifying the “best” set of independent variables is a common challenge when building statistical models. To identify the variables to be included in their models, many analysts use automated variable selection methods, e.g., forward elimination, backward deletion, stepwise selection, and single variable screening. In simple random samples, these selection methods have been criticized for resulting in an upward bias in the regression coefficients and a downward bias in the standard errors, which may result in the selection of a suboptimal set of independent variables. Even with these criticisms, these same selection methods are being applied to complex survey data. To avoid these criticisms, our approach to find the “best” model for complex survey data is to use k-fold cross-validation to directly estimate the test error. We estimate the test error for all possible subsets of independent variables and consider the subsets with smallest test errors as our possible model. In this paper, we will demonstrate an application of this approach for regression and logistic regression models using complex survey data and discuss the additional challenges related to the complex survey design
|