Online Program

Return to main conference page

All Times ET

Friday, June 4
Data Visualization
Honoring the Data Science Accomplishments of William S. Cleveland and John W. Tukey
Fri, Jun 4, 3:20 PM - 4:55 PM
TBD
 

Bill Cleveland's Contributions to Analysis of Big Complex Data (309872)

*Wen-wen Tung, Purdue University 
Doug Crabill, Purdue University 

Keywords: big data, statistical theory, distributed parallel computing, MapReduce, Hadoop Distributed File System, Climate Data Analysis

Data science focuses on data analysis. Research in the technical areas of data science makes data analysis more effective. Computational performance depends on data size, the computational complexity of the analytic routines, and the available hardware power. Here, we discuss Bill Cleveland's contributions to analyzing big complex data through a statistical approach and distributed parallel computing. In Divide and Recombine (D&R), data are divided into subsets; each analytic method is applied to the subsets in parallel with no communication among subset processes; the subset outputs are recombined in parallel, with communication among the processes. Cleveland and Guha (2010) developed the R & Hadoop Integrated Programming Environment (RHIPE) software to implement D&R, enabling users to program analysis in R at the front end with the Hadoop system in the back end. Hadoop distributes data across the Hadoop Distributed File System (HDFS); Hadoop MapReduce executes D&R with distributed parallel computing. Harner and Cleveland had planned a hierarchy of deployments, from standalone to the cloud, augmented with rspark. We will demonstrate the research outcomes enabled by the YARN-Hadoop cluster deployment that integrated D&R and MapReduce using RHIPE while exploiting HDFS as a massive data lake. Specifically, we discuss the significance of Cleveland’s collaborative research with subject-matter experts on tracking "atmospheric rivers" over the US and using the information to predict rain.