Abstract:
|
In many supervised learning problems, sets of input variables have a group structure signifying underlying associations. In such cases, modeling strategies that are cognizant of these groupings (e.g. group lasso) make more sense than studying the variables individually. However, to our knowledge, few tree-based algorithms are available that consider the group structure in the splitting criteria in a computationally efficient manner, especially for high-dimensional data. Here, we propose to summarize variables within groups through group-wise principal component analysis and use the resulting principal components for fitting the tree-based algorithms. New group variable importance measures and group variable selection methods are then proposed for decision trees as well as random forest. Simulation studies are presented to show comparative benefits of our method. The proposed algorithm will be applied to gene expression data sets for tumor classification, where the genes are grouped through independent component analysis following a previous analysis.
|