Abstract:
|
Analysts at large web firms are often given the task of analyzing and processing tremendous amounts of data in a quick, iterative fashion. Oftentimes, this involves formulating a sequence of hypotheses to test, each of which queries the same data multiple times but analyzes a different stratum. This presents challenges when each query must be answered in a short amount of time and when computational resources are constrained. Using subsampled data is one way to reduce both the time and computational cost while still being able to provide statistical insights about the data. Furthermore, when the subsampled data is small enough to store in memory, subsampling can greatly increase the set of software packages that may be used for analysis. However, drawing a useful subsample can be problematic when the data is severely skewed, with a few strata dominating the others in size. This work proposes a novel streaming method for drawing a stratified sample from a stream where the memory budget is constrained, the data may be very skewed, and the number of strata of interest is potentially very large.
|
ASA Meetings Department
732 North Washington Street, Alexandria, VA 22314
(703) 684-1221 • meetings@amstat.org
Copyright © American Statistical Association.