Abstract:
|
The haplotype phase problem is to find the true combination of genetic variants on a single chromosome from individuals. Furthermore, haplotypes of a gene can be expressed non-equally, a phenomenon known as allele-specific expression (ASE). Haplotype phasing and quantification of ASE are essential for studying the association between genotype and disease. No existing method solves these two intrinsically linked problems together. Rather, most current strategies have great dependence on known haplotypes or family data. Herein, we present a novel method, IDP-ASE, which utilizes a Bernoulli mixture model for RNA-seq data and MCMC to derive the most likely set of haplotypes, phase each read to a haplotype, and estimate ASE. Our model leverages the strengths of both Second Generation Sequencing (SGS) and Third Generation Sequencing (TGS). The long read length of TGS data facilitates phasing, while the accuracy and depth of SGS data facilitates estimation of ASE. Moreover, IDP-ASE is capable of estimating ASE at both the gene and isoform level. We present the performance of IDP-ASE on simulation data and apply it to data from various real data sets which harbor extensive ASE events.
|