Abstract:
|
When aggregating predictions with a voting rule, it is natural to ask "How many votes are needed to obtain a reliable prediction?" In the context of ensemble classifiers such as Random Forests, this question specifies a trade-off between computational cost and statistical performance. Namely, by a paying a larger computational price for more classifiers, the prediction error of the ensemble tends to decrease and become more stable. Conversely, by sacrificing some statistical efficiency, it is possible to speed up the tasks of training the ensemble and making new predictions. In this paper, we quantify this tradeoff for the methods of Bagging and Random Forests, using a bootstrap-based approach. To be specific, let the random variable Err_t denote the prediction error of a randomly generated ensemble of t classifiers, trained on a fixed dataset. Then as t tends to infinity, we show that the variance var(Err_t) can be consistently estimated via our proposed resampling method. As a consequence, this result offers practitioners a guideline for choosing the smallest number of base classifiers (e.g. decision trees) needed to ensure that var(Err_t) is less than a given value.
|
ASA Meetings Department
732 North Washington Street, Alexandria, VA 22314
(703) 684-1221 • meetings@amstat.org
Copyright © American Statistical Association.