Data Science Workflows Using R and Spark (ADDED FEE) — Professional Development Continuing Education Course
ASA
This short course covers the data science process using R as a programming language and Spark as a big-data platform. Powerful workflows are developed for data extraction, data transformation and tidying, data modeling, and data visualization. The course is taught using a Dockerized virtual cluster with containers for R and RStudio, PostgreSQL, Hadoop, Spark and various NoSQL databases. The interface to the computational environment is a modern web browser, whether the Docker deployment is local or remote. During the course R-based and bash illustrations show how data is transported using REST APIs, sockets, etc. into persistent data stores such as the Hadoop Distributed File System (HDFS), NoSQL databases, relational databases and in some cases sent directly to Spark's real-time compute engine. Workflows using dplyr verbs are used for data manipulation within R, within relational databases (PostgreSQL), and then within Spark using sparklyr. These data-based workflows extend into machine learning algorithms, model evaluation, and data visualization using sparklyr and ggplot2.
The machine learning algorithms are taught using Spark's distributed computational engine on data stored in HDFS or distributed in-memory. Concepts of data locality and methods of avoiding data shuffling are discussed. The supervised techniques include linear regression, logistic regression, generalized linear regression, decision trees, gradient-boosted trees, and random forests. Feature selection is done primarily by regularization and models are evaluated using various metrics. Unsupervised techniques include k-means clustering and dimension reduction. TensorFlow for deep learning will be introduced. Big-data architectures are discussed including the Docker containers used for building the course infrastructure called rspark. See: https://github.com/jharner/rspark
The Docker containers can be run on the desktop, run using vagrant, or deployed to Amazon Web Services (AWS). As a result, students will have access to a full big-data computing platform and extensive course content.
Instructor(s): E. James Harner, West Virginia University