Online Program Home
  My Program

All Times EDT

Abstract Details

Activity Number: 500 - Statistical Learning
Type: Contributed
Date/Time: Thursday, August 6, 2020 : 10:00 AM to 2:00 PM
Sponsor: Section on Statistical Learning and Data Science
Abstract #313083
Title: Effects of Stopping Criterion in the Growth of Trees in Regression Random Forests
Author(s): Aryana Arsham* and Philip Rosenberg and Mark Peter Little
Companies: National Cancer Institute and National Cancer Institute and National Cancer Institute
Keywords: random forest; generalization error; test error; bias; variance; regression

Random forests are a powerful statistical tool that capture complex relationships between features and outcomes of interest. They are commonly used in analysis of healthcare and much other data, via various widely used packages in R. Trees built in a random forest are dependent on several hyperparameters, one of the more critical of which is the node size. The size of the terminal nodes has implications for model variance and therefore on test/generalization error. In consequence most statistical software has parameters that control for this in various ways. In particular the most popular R packages, randomForest and ranger, control for node size by limiting the size of the parent node, so that a node cannot be split if it has less than a specified number of observations. We argue that this hyperparameter should instead be specified as the minimum number of observations in each terminal node; we hypothesize that this would have smaller variance but larger bias than the R packages. The implications of these two approaches will be examined through their effect on test/generalization error, bias, and variance of resulting predictions in a number of simulated datasets.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2020 program