Abstract:
|
A common aim in clinical research is to identify and quantify risk factors. With the rise of genomic data and electronic data capture, the number of explanatory variables may be large, requiring variable selection to get a simple model. Lack of prior knowledge often makes a priori selection undesirable. Known drawbacks of automatic variable selection have motivated alternative methods including penalized regression methods, which address some drawbacks, but have disadvantages. We used simulations to compare performance of various methods for statisticians working in clinical applications, focusing on measures that inform common modelling goals, and across sample sizes and numbers of candidate variables typical of modern observational studies in clinical research. Penalized methods optimize prediction and do not produce unbiased coefficient estimates; as expected we observed low prediction error, but also underestimation of coefficients, leading to predicted probabilities with a small spread. In small datasets, all methods rarely selected the correct model. We discuss practical issues arising in penalized regression that are often not mentioned in penalized regression tutorials.
|