Abstract:
|
Random forest (RF) classification algorithms obscure the relative importance of variables when they randomly select observations and predictors while building and aggregating classification trees (CTs). Many use the Gini variable importance measure (GVIM) to assess variables’ relative contribution to the final prediction. GVIM favors qualitative variables with more categories. This research develops a regression-based bias-corrected GVIM (RBG) that regresses GVIM under the null (no association) on the number of categories. To investigate performance, I conducted a Monte Carlo study that varies (1) the number of categories within qualitative predictors and (2) the level of association of the predictors with the outcome. RBG obtains the corrected GVIM by subtracting the regression-predicted GVIM from the raw GVIM. The Monte Carlo simulation results indicate that when the predictors are strongly correlated with the outcome, RBG provides a more accurate correction, implying a reduction in bias. Therefore, RBG holds promise to improve the assessment of variable importance in some settings.
|