Keywords: Apache Spark, R, Distributed Computing
SparkR is a new and evolving interface to Apache Spark. It offers a wide range of APIs and capabilities to Data Scientists and Statisticians. Spark is a distributed system with a JVM core. SparkR’s interactions between R and JVM are based on a custom RPC channel implemented in Spark. In this talk we will show what goes on under the hood when you use SparkR. We will look at SparkR architecture and its API semantics. Equipped with those, we can go one layer deeper to understand performance bottlenecks and best practices when using SparkR.