Keynote Address | Concurrent Sessions | Poster Sessions
Short Courses (full day) | Short Courses (half day) | Tutorials | Practical Computing Demonstrations | Closing General Session with Refreshments

Last Name:

Abstract Keyword:



Viewing Short Course (full day)s onlyView Full Program
Thursday, February 23
SC1 Art and Practice of Classification and Regression Trees
Thu, Feb 23, 8:00 AM - 5:30 PM
River Terrace 2
Instructor(s): Wei-Yin Loh, University of Wisconsin

Download Handouts
It is more than 50 years since the first regression tree algorithm (AID, Morgan and Sonquist 1963) appeared. Rapidly increasing use of tree models among practitioners has stimulated many algorithmic advances over the last two decades. Modern tree models have higher prediction accuracy, increased computational speed, and negligible variable selection bias. They can fit linear models in the nodes using GLM, quantile, and other loss functions; response variables may be multivariate, longitudinal, or censored; and classification trees can employ linear splits and fit kernel and nearest-neighbor node models. The aims of the course are: (i) to briefly review the capabilities of the state-of-the-art methods and (ii) to show how to exploit free software to analyze data from initial data exploration to a final interpretable prediction model. Example applications include subgroup identification for precision medicine, missing value imputation, and propensity score estimation in sample surveys.

Outline & Objectives


1. Review of classification trees. Comparisons of algorithms on prediction accuracy, computation, and selection bias.

2. Review of regression trees for least squares, quantile, Poisson, and relative risk regression. Effect of collinearity, nonlinearity, variance heterogeneity, and missing data.

3. Importance scoring of variables.

4. Inference for tree models. Bootstrap trees and confidence intervals.

5. Tree ensembles.

6. Step-by-step analysis of real data using free software, from data exploration to final model. Examples include:

(a) Subgroup identification for precision medicine in a breast cancer trial with censored response.

(b) Subgroup identification in a diabetes trial with longitudinal responses.

(c) Missing value imputation and propensity score estimation for the U.S. Consumer Expenditure Survey.

(d) Analysis of data on circuit board soldering from a factorial design.

(e) Joint modeling of mother's stress and child's morbidity in a longitudinal study.


1. Reveal the power and versatility of tree models.

2. Show how to exploit advanced features of existing software.

About the Instructor

Wei-Yin Loh has been actively doing research in the subject for almost thirty years. He is the developer or co-developer of the FACT, QUEST, CRUISE, LOTUS and GUIDE algorithms and has supervised more than twenty PhD theses in this area. He has given one and two-day courses on classification and regression trees to professional societies (KDD 1999, 2001; JSM 2007, 2011, 2013, 2015; U.S. Army Applied Statistics Conference 1995, 1999; Interface Conference 2013; ASA Northeastern Illinois Chapter 2014; ICSA Applied Statistics Symposium 2015; Midwest Biopharmaceutical Statistics Workshop 2015; Washington Statistical Society 2015), major biopharmaceutical companies, and overseas universities (National University of Singapore 2010 and 2014; East China Normal University 2012; National Tsinghua University, Taiwan, 2012; City University of Hong Kong 2014). He is a consultant on regression tree methods to government and industry. He regularly teaches a semester-long graduate course on the subject at the University of Wisconsin, Madison.

Relevance to Conference Goals

The course will introduce to beginners a new set of statistical tools that is time-tested and is used increasingly in academia and industry. It will teach non-beginners how to exploit the features of existing software, such as their use in simulation experiments. In addition, the course will teach attendees how to respond to questions about tree models, such as their interpretation and statistical significance, and how they compare with models from traditional methods in terms of prediction accuracy, underlying assumptions, and sensitivity to outliers, collinearity, and missing values. The course will have an immediate positive impact on a statistician's job by increasing his/her array of tools.

Software Packages


R packages: rpart, randomForest, party