Friday, February 24
CS15 Machine Learning Fri, Feb 24, 3:45 PM - 5:15 PM
City Terrace 7

Tree-Based Techniques for High-Dimensional Data (303323)

*Wei-Yin Loh, University of Wisconsin 

Keywords: classification and regression trees and forests, missing data, importance scores, data visualization

This presentation aims to show how tree-based methods can solve some common but difficult big-data problems, such as variable selection, visualization, model fitting, and missing values. Because they are nonparametric, the complexity of tree-based models automatically scales with data size and complexity. Their fast computation speed makes them practical for big data applications (no problems with convergence, collinearity, singular covariance matrices, and quasi-complete separation). Large numbers of observations or variables in big data can make it challenging to visualize all of the data in one piece. Typically, models constructed from big data are even harder to visualize. These problems do not plague tree models. Model visualization is transparent because the models are decision trees. And by segmenting the data into smaller and more homogeneous pieces, tree models allow the data to be visualized in more meaningful and low-dimensional pieces. Further, tree methods can screen large numbers of variables and score them in terms of their prediction importance for variable selection. Finally, tree models can fit models to data with missing values without requiring prior imputation or variable deletion.