Online Program

Return to main conference page

All Times EDT

Thursday, June 4
Data Visualization
Divide and Recombine for Big Data Analysis and Visualization
Thu, Jun 4, 1:20 PM - 2:55 PM
TBD
 

Divide and Recombine (D&R) with R-RHIPE-Hadoop Software (308168)

*William S. Cleveland, Purdue University 

Keywords: key-value pairs, MapReduce, Hadoop Distributed File System, statistical theory, data visualization

D&R: The analyst divides the data into subsets. Each analytic method is applied to the subsets in parallel with no communication among subset processes. The subset outputs are recombined in parallel, with communication among the processes. The goal can be one result for all the data, say, logistic regression. D&R research seeks division and recombination methods that optimize statistical accuracy, resulting in "statistical division". Often, division is by "subject-matter division", carried out by conditioning on variables important to the analysis. R/RHIPE/Hadoop software implements D&R. The user programs analyses in R, the front end, with the Hadoop back end. RHIPE (R & Hadoop Integrated Programming Environment) provides communication between front and back, and R functions to aid in programming D&R. The hardware can be a cluster or a multicore machine . Hadoop writes subsets and outputs, spreading them across the Hadoop Distributed File System. Hadoop Map does the parallel computation of the analytic method applied to subsets without process communication. Hadoop Reduce does the parallel recombination computation of the outputs with process communication. WHAT YOU GET: (1) D&R with R-RHIPE-Hadoop enables deep analysis, which means analysis of the data in detail at their finest granularity. This includes a powerful framework for deep visualization. (2) D&R with R-Rhipe-Hadoop can provide a dramatic increase in computational complexity of analytic methods, and data size. (3) Data can have a memory size that is larger than the physical memory of the hardware because when subsets and outputs are analyzed, they are put in memory sequentially, not all at once. (4) D&R provides a programming of D&R that is very efficient for users, protected from having to manage details of parallel computing and database management. (7) All of this is illustrated by an application to 10,621,808,809 queries to the Spamhaus blacklisting service.