Abstract:

Random forests are a powerful statistical tool that capture complex relationships between features and outcomes of interest. They are commonly used in analysis of healthcare and much other data, via various widely used packages in R. Trees built in a random forest are dependent on several hyperparameters, one of the more critical of which is the node size. The size of the terminal nodes has implications for model variance and therefore on test/generalization error. In consequence most statistical software has parameters that control for this in various ways. In particular the most popular R packages, randomForest and ranger, control for node size by limiting the size of the parent node, so that a node cannot be split if it has less than a specified number of observations. We argue that this hyperparameter should instead be specified as the minimum number of observations in each terminal node; we hypothesize that this would have smaller variance but larger bias than the R packages. The implications of these two approaches will be examined through their effect on test/generalization error, bias, and variance of resulting predictions in a number of simulated datasets.
