Thursday, May 17

Data Science Platforms I

Thu, May 17, 5:15 PM - 6:15 PM
Grand Ballroom G

Building Data Science Platforms Using Docker (304907)

*Jim Harner, West Virginia University
Mark Lilback, Rc2ai
Will Foreman, Rc2ai

Keywords: Docker Containers, Rocker, Rspark, Data Science, Singularity Containers

The use of Docker for containerization has grown rapidly since its introduction in 2014 and is expected to explode going forward. Docker usage in the cloud is already about 50%. So why is this happening?

VM hypervisors virtualize hardware and thus are very heavy. On the other hand, containers (instances of images created by developers) are much lighter since they share the resources of a single instance of Linux. In addition, Docker allows developers to continuously integrate (CI) their code in a shared repository and to continuously deploy (CD) their code. This facilitates how developers can ship their applications as lightweight, self-contained, portable images.

Perhaps the most important reason for using Docker by statisticians and data scientists is the complexity of their computing environments, e.g., aligning version numbers for R and its packages, drivers, SQL and NoSQL databases, Hadoop, Spark and it packages, etc. Use cases for R based on Rocker (https://github.com/rocker-org/rocker) and R + Spark based on Rspark (https://github.com/jharner/rspark) illustrate the difficulties encountered. Rocker provides Docker images for customized R environments. These images range from base R (r-base) to the Tidyverse (verse) and Shiny (shiny) all built on a Debian foundation.

Rspark's principal image, called rstudio, is built on top of Rocker's verse image, which contains RStudio, the Tidyverse, LaTeX, etc. Various data science R packages are added to this image, e.g., sparklyr, tensorflow, rhipe, as well as database and distributed storage (HDFS) drivers and a single instance of Spark. Rspark has other images for PostgreSQL, Hadoop, Hive, and a cluster version of Spark. These Docker images can easily be converted to Singularity images (https://singularity.lbl.gov), which do not require root access, and thus can be used in high-performance computing environments.

RSpark is designed primarily for teaching. However, a master-worker version of Spark is being developed, which if orchestrated, will allow large-scale deployments. Typically, Rspark is deployed to AWS and the user clones and pushes their code to GitHub using RStudio. Persistent data, even if large, is stored in Docker volumes.

Online Program

Building Data Science Platforms Using Docker (304907)

ASA Meetings Department