Online Program Home
My Program

Abstract Details

Activity Number: 274 - Random Forests in Big Data, Machine Learning and Statistics
Type: Invited
Date/Time: Tuesday, July 31, 2018 : 8:30 AM to 10:20 AM
Sponsor: Section on Statistical Learning and Data Science
Abstract #326504 Presentation
Title: Random Forests for Big Data
Author(s): Jean-Michel Poggi* and Robin Genuer and Nathalie Villa-Vialaneix and Christine Tuleau-Malot
Companies: LMO, University Paris Sud and ISPED, Univ. Bordeaux and MIA-T, INRA of Toulouse and University Nice, CNRS, LJAD
Keywords: Random forest; Parallel computing; Bag of little bootstraps; Big Data; On-line learning; R

Big Data are a major challenge of statistical science and has numerous algorithmic and theoretical consequences. Big Data always involve massive data and often includes online data and data heterogeneity. Recently statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests (RF) are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as classification ones. Focusing on classification problems, this talk proposes a review of proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of RF. We also describe how the out-of-bag error is addressed in these methods. Then, we formulate various remarks for RF in the Big Data context. Finally, we experiment five variants on two massive datasets, a simulated one and a real-world dataset. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2018 program