JSM 2016 Online Program

Activity Number:	526
Type:	Topic Contributed
Date/Time:	Wednesday, August 3, 2016 : 10:30 AM to 12:20 PM
Sponsor:	Section on Statistical Computing
Abstract #319796	View Presentation
Title:	Scaling Up Statistical Models to Hadoop Using Tessera
Author(s):	Jim Harner*
Companies:	West Virginia University
Keywords:	divide and recombine ; big data ; trellis displays ; R and Hadoop ; Tessera ; logistic regression
Abstract:	Building statistical models for large, complex data in R is challenging due to its design constraints. Bill Cleveland and his students built Tessera, a computational environment based on divide and recombine (D & R), to overcome R's big-data limitations. The components of this environment are illustrated using logistic regression to analyze web data. D & R (as implemented in Tessera's datadr R package) allows these models to be scaled: from in-memory/ single-core R, to local disk/ multicore R, to the Hadoop Distributed File System (HDFS)/ R and Hadoop (using Tessera's Rhipe package). Trellis displays (as implemented in Tessera's trelliscope R package) are used to gain insight into the web data using big-data visualizations. The analyses are run by provisioning Tessera on a single-node Vagrant VM. A new programming architecture, based on Linux, Mesos, and Docker containers, demonstrates the potential for running Tessera and other big-data platforms in a user-friendly, but powerful way.

Authors who are presenting talks have a * after their name.