Online Program

Return to main conference page
Saturday, May 19
Computational Statistics
Bioinformatics
Sat, May 19, 10:30 AM - 12:00 PM
Regency Ballroom A
 

Big Data Distributed System for Phenome and Genome Management and Analysis in a Large Health System (304605)

Aaron Black, Inova Translational Medicine Institute 
John F Deeken, Inova Translational Medicine Institute 
Shan Gao, Inova Translational Medicine Institute 
Henry Hunter, Inova Translational Medicine Institute 
Prachi Kothiyal, Inova Translational Medicine Institute 
Xinyue Liu, Inova Translational Medicine Institute 
Sakthi Madhappan, Inova Translational Medicine Institute 
John E Niederhuber, Inova Translational Medicine Institute 
Lin Smith, Inova Translational Medicine Institute 
*Wendy S.W. Wong, Inova Translational Medicine Institute 
Fang Zhou, Inova Translational Medicine Institute 
Wei Zhu, Inova Translational Medicine Institute 

Keywords: big data, health systems, genomics, Hadoop, spark, Cloudera

The continuous incoming of High Throughput Sequencing data quickly overwhelms the bioinformatics analysis paradigm based on traditional clusters and relational databases. Innovative "Big data" solutions built on the open-source Apache Hadoop and Spark cluster technology have been employed to address the challenge. ADAM and Hail are two of the cutting-edge projects in the area of big data genomics. To leverage these powerful new tools while considering the practical applications to support Inova Health System's translational genomic research, we are building an integrated system composed of a Hadoop data warehouse (DW) with Cloudera Impala as the backend, an ETL (Extraction, Transformation, Loading) workflow using ADAM and Spark, an analysis platform middle tier powered by Spark and Hail, and a web front-end for ad hoc query and interactive data analysis. Examples on use cases are presented to demonstrate the power of our integrative big data genomic system for handling petabyte-scale data.