Abstract:
|
Medical researchers are often interested in selecting a panel of predictor variables for diagnostic or prognostic models. A standard statistical approach is the use of logistic regression to identify markers of patient status such as cancer or control with performance assessed by the area under the ROC curve (AUC). This scenario is especially common in biomarker validation studies which can include large numbers of predictor variables relative to the sample size. Researchers typically try to select the "best" model by using automated variable selection techniques such as forward stepwise, best subsets, or LASSO. We propose that ridge regression often has a higher out of sample AUC than the more standard methods in most circumstances and should be more frequently used. Our study involves assessing the different variable selection methods across 20 real biomarker datasets ranging in sample sizes from 12-160 and number of markers from 5-800.
|