With the growth of big data, variable selection has become one of the critical challenges in statistics. Although many methods have been proposed in the literature their performance in terms of recall and precision are limited in a context where the number of variables by far exceeds the number of observations or in a highly correlated setting.
We propose a general algorithm which improves the precision of any existing variable selection method. This algorithm is based on highly intensive simulations and takes into account the correlation structure of the data.
The user of selectBoost can use this algorithm to produce a confidence index or choose an appropriate precision-selection trade-off to select variables with high confidence and avoid selecting non-predictive features. The main idea behind our algorithm is to take into account the correlation structure of the data and thus use intensive computation to select reliable variables.
We succeeded in improving the precision of the lasso selection method with relative stability on recall and F-score and we show the performance of our algorithm on simulated and real data for linear models.
Available as a CRAN R-Package.
|