Online Program

Return to main conference page
Thursday, May 17
Data Science
Big Data Analytics Using R and Spark
Thu, May 17, 1:30 PM - 3:00 PM
Grand Ballroom G
 

Data Science Workflows (304377)

*Jim Harner, West Virginia University 

Keywords: dplyr, Spark, workflows, machine learning, Docker containers

The data science process is implemented by using R as a programming language and Spark as a big-data platform. Powerful workflows are developed for data: extraction, tidying, and transformation. The workflows use 'dplyr' verbs for data manipulation within R, within relational databases (PostgreSQL), and within Spark (using 'sparklyr').

These data-based workflows extend into machine learning algorithms, model evaluation, and data visualization. The machine learning algorithms include supervised techniques in which feature selection is done primarily by regularization. Models are evaluated using various metrics. Unsupervised techniques and TensorFlow are briefly discussed.

The underlying big-data architecture is discussed including the Docker containers used for building the infrastructure called 'rspark' (https://github.com/jharner/rspark). The Docker containers can be run on the desktop, run using vagrant, or deployed to Amazon Web Services (AWS).