Friday, February 16  
CS11 Data Mining Algorithms 
Fri, Feb 16, 2:00 PM  3:30 PM
Salon D 
Stochastic Gradient Boosting on Distributed Data (303536)*Roxy Cramer, Rogue Wave SoftwareKeywords: Stochastic gradient boosting, decision trees, distributed data, scaling up, predictive models Stochastic gradient boosting (SGB) is a popular and effective predictive model, but scaling up the model to distributed data is complicated by the fact that the algorithm is sequential. Achieving the exact estimate requires, in effect, coding the algorithm on the network so that each iteration of the algorithm involves all the data. This level of coding may be difficult to do in the short term for many practitioners. We propose an approximate method that fits SGB to each processor individually, but communicates from one node to the next partitions of the data that are used to initialize the SGB on the next processor. We compare the method to the exact method and one other approximate method on several data sets. The advantage of the approach is that it can be used as a proof of concept to determine whether or not the problem is predictable, given the data, before spending a good deal more for potentially more accuracy.
