Classification models can demonstrate high apparent prediction accuracy even when there is no underlying relationship between the predictor variables and the response. Consequences of variable selection bias are often underestimated with high likelihood of false positive variable selections and overestimation of true model performance.
A simulation study was conducted using logistic regression with forward stepwise, best subsets, and LASSO variable selection techniques with varying sample sizes and numbers of random noise predictor variables. The area under the ROC curve (AUC), number of variables selected, and apparent statistical significance of the final models were extracted. More appropriate AUC cutoffs controlling the false positive rate were extracted from the simulation results.
These variable selection techniques consistently selected noise predictors for inclusion in the models. The critical values for the AUC we propose provide better thresholds for determining whether there is more than a chance association between predictors and outcome; preventing needless follow-up on biomarkers with no true underlying association to the outcome.
|
ASA Meetings Department
732 North Washington Street, Alexandria, VA 22314
(703) 684-1221 • meetings@amstat.org
Copyright © American Statistical Association.