Keywords: dplyr, Spark, workflows, machine learning, Docker containers
The data science process is implemented by using R as a programming language and Spark as a big-data platform. Powerful workflows are developed for data: extraction, tidying, and transformation. The workflows use 'dplyr' verbs for data manipulation within R, within relational databases (PostgreSQL), and within Spark (using 'sparklyr').
These data-based workflows extend into machine learning algorithms, model evaluation, and data visualization. The machine learning algorithms include supervised techniques in which feature selection is done primarily by regularization. Models are evaluated using various metrics. Unsupervised techniques and TensorFlow are briefly discussed.
The underlying big-data architecture is discussed including the Docker containers used for building the infrastructure called 'rspark' (https://github.com/jharner/rspark). The Docker containers can be run on the desktop, run using vagrant, or deployed to Amazon Web Services (AWS).