Abstract:
|
Building statistical models for large, complex data in R is challenging due to its design constraints. Bill Cleveland and his students built Tessera, a computational environment based on divide and recombine (D & R), to overcome R's big-data limitations. The components of this environment are illustrated using logistic regression to analyze web data. D & R (as implemented in Tessera's datadr R package) allows these models to be scaled: from in-memory/ single-core R, to local disk/ multicore R, to the Hadoop Distributed File System (HDFS)/ R and Hadoop (using Tessera's Rhipe package). Trellis displays (as implemented in Tessera's trelliscope R package) are used to gain insight into the web data using big-data visualizations. The analyses are run by provisioning Tessera on a single-node Vagrant VM. A new programming architecture, based on Linux, Mesos, and Docker containers, demonstrates the potential for running Tessera and other big-data platforms in a user-friendly, but powerful way.
|