Abstract:
|
Advances in computing (processors) have outpaced advances in data storage and network bandwidth. Computational scientists are able to perform large-scale, high-resolution simulations in both space and time, but yet, cannot examine all of the generated data at once and interactive visualization and queries are prohibitively slow. Data partitioning, or sub-selection, becomes necessary to reduce the data size. Our task is to partition the data so that every element (point, cell, row, etc.) of the raw data belongs to one and only one partition. We then store summary information about each partition, rather than the raw data itself, reducing its size: such as a representative value plus error or a distribution, and most importantly, preserving the interesting data characteristics. When creating the partitions, there are many decisions that must be made. We present a metric for evaluating data partitioning quality. It is inspired by model comparison techniques and was created to balance the tradeoffs between raw data reproducibility, accuracy, and storage costs. We explore and evaluate the metric’s performance on partitioning data from real world, large scale simulations.
|