Abstract:
|
Despite their well-established record, a full and satisfying explanation for the success of random forests has yet to be put forth. Here, we take a step in this direction. Comparing against bagging with non-randomized base learners, we demonstrate that random forests are implicitly regularized by the additional randomness injected into individual trees, making them highly advantageous in low signal-to-noise (SNR) settings. Furthermore, we show that this regularization property is not unique to tree-based ensembles and can be generalized to other supervised learning procedures. Motivated by this, we find that another surprising and counterintuitive means of regularizing ensembles can come from the inclusion of additional random noise features in the model. Importantly, this leads to substantial concerns about common notations of variable importance based on improved model accuracy, as even purely random noise can routinely register as statistically significant.
|