Abstract:
|
Dramatic increases in the size of datasets have made traditional statistical inference techniques prohibitive. Surprisingly, very little attention has been given to developing inferential algorithms for data whose volume exceeds the capacity of a single-machine system. A question of immediate concern is how can we design a data-intensive statistical inference architecture without changing the fundamental data modeling principles developed over the last century. To address this problem, we present MetaLP, a flexible and distributed statistical modeling paradigm suitable for large-scale data analysis where statistical inference meets big data technology. This generic statistical approach addresses two main challenges of large datasets: (1) massive volume and (2) data variety or mixed data. We also present an application of this general theory in the context of a nonparametric two sample inference algorithm for Expedia personalized hotel recommendations based on 10 million search result records.
|