Online Program

Return to main conference page
Friday, May 18
Applications
Applications of Divide and Recombine to Big Data
Fri, May 18, 1:30 PM - 3:00 PM
Lake Fairfax A
 

Divide & Recombine (D&R) with DeltaRho for Big Data Analysis (304537)

Presentation

*William S. Cleveland, Purdue 

Keywords: big data, visualization, R language and environment, Hadoop, parallel computing

In D&R, the analyst divides the data into subsets by a D&R division method. Each analytic method is applied to each subset, independently, without communication. Outputs of each analytic method are recombined by a D&R recombination method. Sometimes the goal is one result for all of the data, such as a logistic regression; D&R theory and methods seek division and recombination methods to maximize the statistical accuracy. In practice, division is commonly based on the subject matter. The data are divided by conditioning on variables important to the analysis; the outputs can be the final result, or further analysis of outputs is carried out. Much of D&R computation is the simplest: embarrassingly parallel. DeltaRho D&R software is open-source (www.deltarho.org). The front end is the DeltaRho R package datadr. The back end is a distributed database and parallel compute engine (DD-PCE) that spreads subsets and outputs across a database, and executes the analyst R and datadr code in parallel. The DeltaRho software component RHIPE provides integration of datadr and the widely used Hadoop DD-PCE. With D&R, we get deep analysis, which means analysis of the data at their finest granularity, including visualization. We get all of the tasking of data analysis, not just optimization. Through R we have access to the 1000s of methods of statistics, machine learning, and visualization. DeltaRho makes it easy to program D&R, protecting the analyst from the details of parallel computation and database management. DeltaRho can increase dramatically the data size and analytic computational complexity that are feasible in practice, whether the available hardware power is small, medium, or large. This performance does not require that the all of the data reside in memory at the same time, which for a large fraction of analyses in practice is a severe limitation. In fact, data can have a memory size that is larger than the physical memory.