Abstract:
|
Cross-study validation (CSV) of prediction models is an alternative to traditional cross-validation (CV) in domains where multiple comparable datasets are available. Although many studies have noted potential sources of heterogeneity in genomic studies, to our knowledge none have systematically investigated their intertwined impacts on prediction accuracy across studies. We employ a hybrid parametric/non-parametric bootstrap method to realistically simulate publicly available compendia of microarray, RNA-seq, and whole metagenome shotgun (WMS) microbiome studies of health outcomes. We assessed CSV accuracy while manipulating the following types of heterogeneity and combinations of them: 1) prevalence of clinical and pathological covariates, 2) differences in predictor covariance as could arise from batch effects, and 3) differences in the ``true'' model predicting outcome. The most easily identifiable sources of study heterogeneity are consistently not the primary ones that undermine the ability to accurately replicate the accuracy of omics prediction models in new studies. Unidentified heterogeneity, such as could arise from unmeasured confounding, may be more important.
|