Abstract:
|
Regression variable selection procedures are used widely each day to estimate sparser, more interpretable models in every quantitative field. When analyzing large, high-dimensional datasets, greedy selection algorithms such as Forward Selection (FS) are valued for their low computational costs and their ability to deal with the case of more variables than observations (p > n). We derive sufficient conditions for FS to attain exact recovery of the true model support in the deterministic case, as well as for model selection consistency in the random case. Our conditions allow p to grow with n. For situations where the true model size is not known, we develop a consistent stopping rule based on a sequential variant of Monte Carlo Cross Validation. Finally, for linear models, model-selection-consistent cross validation requires the data-splitting (training/testing) ratio to go to 0 asymptotically. This is not a practical ratio to use with any finite sample. We provide pragmatic suggestions for the data-splitting ratio in large but finite samples, as well as heuristic advice for balancing underfit vs. overfit in small samples.
|