Abstract:
|
The integrative analysis of disparate datasets is an important strategy in data analysis. It is increasingly popular in the field of genomics, which enjoys a wealth of publicly available datasets that can be compared and contrasted, or combined with new data, to extract novel scientific insights. This paper studies a simple but non-trivial example of data integration: leveraging an auxiliary sequence of side information for the simultaneous estimation of a vector of normal means. This task is formulated as a compound decision problem, an oracle integrative decision rule is derived, and a data-driven estimate of this rule, based on minimizing a SURE estimate of the oracle risk, is proposed. The data-driven rule is shown to asymptotically achieve the minimum possible risk among all separable decision rules, and its good performance is demonstrated in numerical properties. The proposed method leads naturally to an integrative high-dimensional classification procedure, which is shown to be capable of outperforming non-integrative methods in problems in genomics.
|