Abstract:
|
We introduce the Programming with Big Data in R (pbdR) project composed of several packages available at http://pbdr.org/ and on CRAN. The packages provide a broad parallel computing capability that spans multicore laptops through multi-node clusters to supercomputers. Our philosophy is to learn from the high performance computing and provide native R interfaces. The pbdR aims to bring R and statistical computing to supercomputer architectures where a combination of shared memory, distributed memory, and co-processor hardware is available.
The pbdR was initially developed for the message passing interface (MPI) environment. Later, it focused on using scalable numerical libraries (ScaLAPACK) enlarging R's capability on high performance computing systems. Subsequently, several statistical applications had been implemented and applied to treascale datasets. In addition to batch programming, we recently developed a client-server interface capable of interactive programming on distributed systems. By utilizing a asynchronous messaging library (ZeroMQ), interactive control of a distributed set of R sessions, cooperating in a single program multiple data (SPMD) fashion is possible.
|