Online Program

Return to main conference page
Thursday, May 17
Data Science
Big Data Analytics Using R and Spark
Thu, May 17, 1:30 PM - 3:00 PM
Grand Ballroom G
 

Interacting with Distributed Data from R using SparkR (304549)

Presentation

*Hossein Falaki, Databricks 

Keywords: Apache Spark, R, Distributed Computing

SparkR is a new and evolving interface to Apache Spark. It offers a wide range of APIs and capabilities to Data Scientists and Statisticians. Spark is a distributed system with a JVM core. SparkR’s interactions between R and JVM are based on a custom RPC channel implemented in Spark. In this talk we will show what goes on under the hood when you use SparkR. We will look at SparkR architecture and its API semantics. Equipped with those, we can go one layer deeper to understand performance bottlenecks and best practices when using SparkR.