|
Activity Number:
|
304
|
|
Type:
|
Topic Contributed
|
|
Date/Time:
|
Tuesday, August 5, 2008 : 2:00 PM to 3:50 PM
|
|
Sponsor:
|
Section on Statistical Computing
|
| Abstract - #302244 |
|
Title:
|
High-Performance Processing of Large Data Sets via Memory Mapping: A Case Study in R And C++
|
|
Author(s):
|
Daniel Adler*+ and Jens Oehlschlägel and Oleg Nenadic and Walter Zuccini
|
|
Companies:
|
Georg-August University of Göttingen and Research Consultant and Georg-August University of Göttingen and Georg-August University of Göttingen
|
|
Address:
|
Platz der Göttinger Sieben 5, Göttingen, International, 37085, Germany
|
|
Keywords:
|
large dataset processing ; C++ ; R ; memory-mapping
|
|
Abstract:
|
We present the current status of a package (called 'ff') for processing large data sets that don't fit in memory. While database systems are effective for selecting subsets of complex-structured data, mass data processing in scientific contexts works on flat structures (such as vectors and matrices) whose simplicity can be exploited to enhance performance. For example mirroring regions of persistent storage into main memory (memory mapping) enables processing of the dataset in a transparent manner. We illustrate the above concepts with new R container types that mimic R vectors and matrices. In effect, these enable one to work on large data sets using familiar functions. The C++ framework allows one to specify new data types. Space-saving virtual storage modes, such as 1-bit logical or single-precision reals, are implemented.
|