Abstract:
|
The explosive growth of high-throughput genomic data brings challenges as well as opportunities to statistics. Such large scale of data makes it possible to predict one high-throughput genomic data type from another data type. This can be formulated as a challenging big data regression problem of fitting millions of high-dimensional regression models simultaneously. Here, we introduce BIRD, a big data regression model, to handle such high dimensionality and heavy computation. BIRD utilizes the correlation structure within and between data types to make fast and accurate predictions. We applied BIRD to predict DNase I hypersensitivity (DH) based on gene expression. We found that gene expression to a large extent predicts DH. We show that the predicted DH predicts transcription factor binding sites (TFBSs), BIRD can be applied to gene expression samples in Gene Expression Omnibus (GEO) to predict regulome for various biological contexts, and the predicted DH can be used as pseudo-replicates to improve the analysis of high-throughput regulome profiling data.
|