Abstract:
|
Random forests have remained among the most popular off-the-shelf machine learning methods since their inception in 2001. Recent work provides strong evidence that the randomization in random forests serves as a form of implicit regularization, making them ideal models in low signal-to-noise ratio settings. Our work here provides another mean of regularization, namely, the inclusion of additional noise covariates in the model. Improvement from this sort of “augmented” bagging procedure can sometimes be even greater than traditional random forests. More importantly, this has crucial implications for metrics designed to measure variable importance, many of which compare model performance with vs without a set of features included. Our work implies that model improvements can exist in some procedures even when the features of interest are completely independent of the remaining data. Thus, we advocate comparing model performance with the original features against those with where feature subsets are replaced by random substitutes.
|