Abstract:
|
Feature selection from a large number p of covariates in a regression analysis challenges data science, especially for scaling to ever-enlarging data and finding scientifically important features. The modern approach to feature selection in large-p data uses a penalized likelihood or a shrinkage estimation, such as a LASSO, SCAD, Elastic Net, or MCP procedure. The randomForest procedure is another alternative. We present a different approach using a new subsampling method, called the Subsampling Winner algorithm (SWA), subsampling from p features (not from n observations). Due to its subsampling nature, SWA can scale to data of any dimension in principle. SWA has the best-controlled false discovery rate in comparison with the aforementioned procedures while having a competitive true feature discovery rate, in a linear regression setting. We investigate the reasons behind its good performance, provide practical strategies to double assure an SWA selection, and discuss its extension to a more general setting. We shall also discuss computational improvements and SWA's relation with some machine learning algorithms.
|