Abstract:
|
Massive data often consists of a growing number of potentially heterogeneous sub-populations. This fact violates the assumption of i.i.d. observations in popular large-scale methods, e.g., divider-and-conquer. In this paper, we propose a testing procedure for detecting heterogeneity in the framework of high dimensional linear models. A new sparsity arises in this problem, termed as heterogeneity sparsity. In theory, we prove that our large-scale test procedure is asymptotically consistent, and minimax optimal in the sense that it can consistently detect departure from null of a magnitude that no other tests could improve. In addition, the test is adaptive to unknown heterogeneity sparsity. An interesting phenomenon is revealed that, to ensure heterogeneity detection consistency, if either of model dimensionality and the number of sub-populations is large, the other should not be relatively too small. We name this phenomenon as bless of massive data. These theoretical results hold, in particular, when model dimensionality grows exponentially fast and the number of sub-population diverges. As a by-product, a consistent estimator of heterogeneity sparsity is proposed.
|