Online Program Home
My Program

Abstract Details

Activity Number: 526
Type: Topic Contributed
Date/Time: Wednesday, August 3, 2016 : 10:30 AM to 12:20 PM
Sponsor: Section on Statistical Computing
Abstract #319796 View Presentation
Title: Scaling Up Statistical Models to Hadoop Using Tessera
Author(s): Jim Harner*
Companies: West Virginia University
Keywords: divide and recombine ; big data ; trellis displays ; R and Hadoop ; Tessera ; logistic regression
Abstract:

Building statistical models for large, complex data in R is challenging due to its design constraints. Bill Cleveland and his students built Tessera, a computational environment based on divide and recombine (D & R), to overcome R's big-data limitations. The components of this environment are illustrated using logistic regression to analyze web data. D & R (as implemented in Tessera's datadr R package) allows these models to be scaled: from in-memory/ single-core R, to local disk/ multicore R, to the Hadoop Distributed File System (HDFS)/ R and Hadoop (using Tessera's Rhipe package). Trellis displays (as implemented in Tessera's trelliscope R package) are used to gain insight into the web data using big-data visualizations. The analyses are run by provisioning Tessera on a single-node Vagrant VM. A new programming architecture, based on Linux, Mesos, and Docker containers, demonstrates the potential for running Tessera and other big-data platforms in a user-friendly, but powerful way.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2016 program

 
 
Copyright © American Statistical Association