Abstract:
|
The lack of replicability and reproducibility in research studies threatens the scientific community. We focus on the problem with selective inference in practice and how it can contribute to the unreliability of the research results. Researchers commonly select or "cherry-pick" models based on the observed data, then construct confidence intervals or tests on the model chosen. Without correcting for the selection procedure, this practice invalidates asymptotic statistical guarantees leading to higher false discovery rates than the researchers had assumed. Through simulation studies on a wide variety of publications, we demonstrate how standard selection procedures: variable selection (step-wise regression, LASSO, PCA) and variable transformation (Box-Cox, dichotomization of continuous variables) can invalidate statistical inference and result in higher false discovery rates. Acknowledging the necessity of model selection in practice, we outline practical remedies such as sample splitting and other corrections for multiple testing to guarantee valid post-selection inference.
|