Keywords: big data, health systems, genomics, Hadoop, spark, Cloudera
The continuous incoming of High Throughput Sequencing data quickly overwhelms the bioinformatics analysis paradigm based on traditional clusters and relational databases. Innovative "Big data" solutions built on the open-source Apache Hadoop and Spark cluster technology have been employed to address the challenge. ADAM and Hail are two of the cutting-edge projects in the area of big data genomics. To leverage these powerful new tools while considering the practical applications to support Inova Health System's translational genomic research, we are building an integrated system composed of a Hadoop data warehouse (DW) with Cloudera Impala as the backend, an ETL (Extraction, Transformation, Loading) workflow using ADAM and Spark, an analysis platform middle tier powered by Spark and Hail, and a web front-end for ad hoc query and interactive data analysis. Examples on use cases are presented to demonstrate the power of our integrative big data genomic system for handling petabyte-scale data.