Keywords: Genomics, Batch effect adjustment, Data harmonization, Ensemble learning, Binary classification
Genomic data are often produced in batches due to practical restrictions, which leads to a potential issue of batch effects, or unwanted variations in data caused by discrepancies across processing batches. Batch effects often have negative impacts on downstream biological analysis, and therefore, needs to be harmonized effectively. In practice, batch effects are usually corrected by specifically designed software, which merge the batches, then estimate batch effects and remove them from the integrated data. We propose a different harmonization strategy through ensemble learning in addressing batch effects in genomic studies. In this framework, we first develop prediction models within each batch, then integrate the models through various ensemble methods. Contrary to the typical approach of removing batch effects from the merged data, our method features the idea of integrating learners rather than data. We provide a systematic comparison between these two harmonization strategies, using RNA-Seq studies targeting diverse populations that are infected with tuberculosis. In one study, we simulated increasing levels of heterogeneity across random subsets of the study, which we treat as simulated batches. We then use the two harmonization methods to address the simulated batch effects, and develop a genomic classifier for a binary indicator of tuberculosis progression status. We evaluate the accuracy of prediction in another independent study targeting a different population cohort. We observed a turning point in the level of heterogeneity, after which our strategy of integrating the learners yields better prediction accuracy in independent validation than the traditional method of integrating the data. These observations provide practical guidelines for data harmonization in the development and evaluation of genomic classifiers.