Abstract:
|
Feature selection from a large number of features in a regression analysis remains a challenge to data science. We present a new subsampling method, called a Subsampling Winner (SW) algorithm for feature selection in large regression data. The central idea of our approach is analogous to that for the selection of national merit scholars. It uses a `base procedure' on each of subsamples, ranks all features by a scoring algorithm according to the performance of these features in the subsample analyses, then obtains the `semifinalist' based on the resulting scores, and finally determines the `finalists',aka the important features from the `semifinalist'. Due to its subsampling nature, our procedure is applicable to data of any dimension in principle, including data that are too large to use a statistical procedure on the full data by an existing software package. We compare our procedure with other procedures including elastic net and SCAD, and illustrate a SWA's application to a genomic data about Ovarian cancer.
|