Abstract:
|
Random forest (RF) is a powerful tool for statistical learning. With the aid of variable importance measures (VIMs), RF can rank the importance of predictors, which can be used for feature selection. However, recent research has demonstrated that VIMs are biased when predictors are correlated and when data contain missing values. Imputation is a well-established method for handling missing values. Nonetheless, it's still not clear how VIMs perform with imputation under different missing mechanisms. A simulation study was conducted to explore the issue: response and correlated predictors were simulated to contain missing values, which were imputed by multivariate imputation by chained equations, and then analyzed by RF and conditional inference forest (CIF); three VIMs were compared: selection frequency (SF), unconditional permutation importance (UPI), and conditional permutation importance (CPI; CIF only). Results suggest SF and CPI are more robust than the conventional UPI for data with missing values and/or correlated predictors. We recommend imputation for missing values before applying CIF and using SF or CPI to reduce the bias due to correlated predictors.
|