Online Program

Return to main conference page
Thursday, May 30
Data Science Techologies
Practice and Applications
Data Science Applications E-Posters, I
Thu, May 30, 3:00 PM - 4:00 PM
Grand Ballroom Foyer

Batch effect adjustment via ensemble learning in the validation of genomic classifiers (306374)

W. Evan Johnson, Boston University 
Giovanni Parmigiani, Dana-Farber Cancer Institute 
*Yuqing Zhang, Boston University 

Keywords: Genomics, Batch effect adjustment, Data harmonization, Ensemble learning, Binary classification

Genomic data are often produced in batches due to practical restrictions, which leads to a potential issue of batch effects, or unwanted variations in data caused by discrepancies across processing batches. Batch effects often have negative impacts on downstream biological analysis, and therefore, needs to be harmonized effectively. In practice, batch effects are usually corrected by specifically designed software, which merge the batches, then estimate batch effects and remove them from the integrated data. We propose a different harmonization strategy through ensemble learning in addressing batch effects in genomic studies. In this framework, we first develop prediction models within each batch, then integrate the models through various ensemble methods. Contrary to the typical approach of removing batch effects from the merged data, our method features the idea of integrating learners rather than data. We provide a systematic comparison between these two harmonization strategies, using RNA-Seq studies targeting diverse populations that are infected with tuberculosis. In one study, we simulated increasing levels of heterogeneity across random subsets of the study, which we treat as simulated batches. We then use the two harmonization methods to address the simulated batch effects, and develop a genomic classifier for a binary indicator of tuberculosis progression status. We evaluate the accuracy of prediction in another independent study targeting a different population cohort. We observed a turning point in the level of heterogeneity, after which our strategy of integrating the learners yields better prediction accuracy in independent validation than the traditional method of integrating the data. These observations provide practical guidelines for data harmonization in the development and evaluation of genomic classifiers.