2013 Joint Statistical Meetings - Celebrating the International Year of Statistics

JSM 2013 Online Program

Online Program Home
My Program

Activity Number:	245
Type:	Contributed
Date/Time:	Monday, August 5, 2013 : 2:00 PM to 3:50 PM
Sponsor:	Section on Statistical Computing
Abstract - #308805
Title:	Generating CHAID Trees on Large and Distributed Data
Author(s):	Damir Spisic*+ and Jing Xu and Xue Ying Zhang
Companies:	IBM and IBM and IBM
Keywords:	CHAID ; distributed data ; MapReduce
Abstract:	CHAID is one of the most established and popular algorithms for building tree models. Target variable can be either continuous or categorical while the predictors must be all categorical. Any continuous predictors are binned before using them in the model. When generating the model, each node can be split into multiple children nodes. The original algorithm is effective for small and medium data sets. However, generating tree models on distributed data containing large number of records and predictors requires a new approach. We present a distributed algorithm for CHAID tree implemented in the MapReduce model of distributed computation. It executes the same number of data passes as the original approach, but the computation in each data pass is fully parallelized taking advantage of all available computational resources. We demonstrate scalability of the distributed algorithm through experiments on a Hadoop cluster.

Authors who are presenting talks have a * after their name.

2013 JSM Online Program Home

For information, contact jsm@amstat.org or phone (888) 231-3473.

If you have questions about the Continuing Education program, please contact the Education Department.

The views expressed here are those of the individual authors and not necessarily those of the JSM sponsors, their officers, or their staff.