Online Program

Return to main conference page
Saturday, June 1
Data Science Techologies
Data Science Platforms: Docker and Kubernetes
Sat, Jun 1, 2:45 PM - 3:50 PM
Grand Ballroom I

RsparkHub: Scaling Rspark with Kubernetes (305066)

*Jim Harner, West Virginia University 
Mark Lilback, Rc2ai 
Alex Harner, Rc2ai 

Keywords: R, Spark, Docker containers, Kubernetes, Ingress

Rspark is a collection of Docker containers for running R, Hadoop, and Spark with various persistent data stores including PostgreSQL, HDFS, Hive, etc. Currently, Rspark uses an RStudio-based edge container, but other interfaces to R and Python could be used, e.g., RCloud. Rspark can be run in single-user mode (running on a local machine or server) or in a multi-user hub called RsparkHub, typically run on a scalable cloud service.

Kubernetes orchestrates sets of running Docker containers and volumes, called pods, which are the smallest deployable unit within a cluster. Applications running in the same pod share the same IP address, the same port space, and the same namespace. Although deployment of pods can be done in different ways, e.g., as a traditional HPC environment, the focus here is in scaling the number of Rspark users.

RsparkHub uses Ingress for managing access by multiple users to the various services supported by Rspark. This involves user authentication and the spawning of requested RSpark services. The canonical RSpark service is a pod running R and RStudio and other tightly coupled applications, e.g., Java. Other services, e.g., PostgreSQL and Spark, are shared, stateful services and are made available to the user depending on the login request and the user's permissions. Kubernetes not only ensures that the dependencies among and within the requested pods are enforced, but also that the number of available pods always exceeds the number of running pods by a specified number.

The power and flexibility of the Rspark architecture is illustrated with several machine learning examples.