Abstract:
|
Typical R workflows load the an entire dataset into memory. When data are large, it is no longer feasible to load all of the data, or even to have all of the data local to the R session. Instead, we need to push computation to the data, which might be located remotely and potentially spread across a computing cluster. We aim to hide this complexity from the user. A general approach is to capture ordinary R code and defer evaluation until the code represents a reduction of the data to a manageable size. We have applied deferred evaluation to separately implement the base R API on top of Solr and Spark. This talk will review those interfaces and discuss the potential for generalization.
|