|Friday, February 16|
|CS11 Data Mining Algorithms||
Fri, Feb 16, 2:00 PM - 3:30 PM
Stochastic Gradient Boosting on Distributed Data (303536)
Keywords: Stochastic gradient boosting, decision trees, distributed data, scaling up, predictive models
Stochastic gradient boosting (SGB) is a popular and effective predictive model, but scaling up the model to distributed data is complicated by the fact that the algorithm is sequential. Achieving the exact estimate requires, in effect, coding the algorithm on the network so that each iteration of the algorithm involves all the data. This level of coding may be difficult to do in the short term for many practitioners. We propose an approximate method that fits SGB to each processor individually, but communicates from one node to the next partitions of the data that are used to initialize the SGB on the next processor. We compare the method to the exact method and one other approximate method on several data sets. The advantage of the approach is that it can be used as a proof of concept to determine whether or not the problem is predictable, given the data, before spending a good deal more for potentially more accuracy.