Abstract:
|
Massive data generally consist of numerous heterogeneous datasets, while some of them may be similar enough to be thought as being drawn from the same sub-population. In this paper, we attempt to characterize this subtle data structure by proposing a Two-layer HEterogeneity Model (THEM) framework that accounts for heterogeneity among sub-populations and within each sub-population. Under this framework, a confidence distribution fusion approach is proposed to discover the underlying sub-population structure, and further achieve highest statistical inferential accuracy as if the true sub-population structure were revealed. This statistical analysis tool can be efficiently implemented in a parallel fashion through an alternating direction method of multipliers. In the end, the proposed methodology is applied to a big climate dataset that reveals a possible association with the El Nino-Southern Oscillation.
|