Abstract:
|
Feature selection has many applications, and is challenging in truly large number of features especially those ever expanding. We present a new approach, called the Subsampling Winner Algorithm (SWA), in large data regression analysis. The central idea of our approach is analogous to that used for the selection of national merit scholars. SWA uses a 'base procedure' on each of the subsamples, computes the scores of all features according to the performance of each feature collected in all subsample analyses, obtains the 'semifinalist' based on the resulting scores, and finally determines the 'finalists', i.e. the most important features, from the 'semifinalist'. We compare SWA with current benchmark procedures using penalized criterion and random forest when features are independent and correlated. We illustrated its application to a genomic data of Ovarian cancer.
|