Abstract:
|
The count-min sketch is a time- and memory-efficient randomized data structure that provides a point estimate of the number of times an item has appeared in a data stream. The count-min sketch and related hash-based data structures are ubiquitous in systems that must track frequencies of data such as URLs, IP addresses, and language n-grams. We present a Bayesian view on the count-min sketch, using the same data structure, but providing a posterior distribution over the frequencies that characterizes the uncertainty arising from the hash-based approximation. In particular, we take a nonparametric approach and consider tokens generated from a Dirichlet process (DP) random measure, which allows for an unbounded number of unique tokens. Using properties of the DP, we show that we can straightforwardly compute posterior marginals of the unknown true counts and that the modes of these marginals recover the classical count-min sketch estimator, inheriting the associated probabilistic guarantees. Additionally, the posterior distribution leads to natural shrinkage estimators for the count, and we demonstrate this on text data. Joint work with Michael Mitzenmacher and Ryan P. Adams.
|