Abstract:
|
Modelling with high dimensional data is both challenging and valuable. Various predictive models, e.g. CART, Random Forest analysis, bagging, neural networks, support vector machines, etc., have been shown to provide useful models for out-of-sample prediction. An alternative approach, known as stochastic gradient boosting (Friedman, 2001 and Friedman et. al. 2000), has demonstrated remarkable results and is therefore often a preferred choice for predictive modeling. However, unlike e.g. Random Forests, which are well known for its scalability, stochastic gradient boosting has its limitations regarding the speed as well as data size it can effectively handle. In other words it does not "scale" well when applied to big data. In addressing this issue we employ the model, "XGBoost" (eXtreme Gradient Boosting), developed by Tianqi Chen and Carlos Guestrin, University of Washington. XGBoost provides an efficient and scalable implementation of gradient boosting. Our study seeks to predict on time arrival behavior of flights using data from RITA database. It will be shown that XGBoost not only provides comparatively high predictive performance but also insures scalability of the model.
|