Keywords: Genomics, RNAseq, Bayesian, Mixture Models, Probabilistic program
With the drop in nucleotide sequencing costs, several researchers are studying transcriptional activities using RNA sequencing (RNAseq) experiments. However, our current understanding of technical noise, bias and measurement error in such data are only preliminary. The availability of large numbers of public RNAseq data enable the possibility of developing unifying models for baseline transcriptomic activity across different tissue types that are robust across sequencing technologies. In this work, I leverage data from a large publicly available study, the Genotype-Tissue Expression project (GTEx) to model transcriptional activity using hierarchical Bayesian mixture models with a goal to better distinguish noise from biological signal in a uniform manner. I will describe the excitement and challenges of doing genomic data science; from acquiring, cleaning and wrangling transcriptomic data to reasoning with statistical models to draw meaningful conclusions. I will also describe how lack of adequate data-curation prevents the larger genomic data science community from utilizing public data repositories with massive numbers of smaller studies.