Abstract:
|
The combination of variants present on a single chromosome is known as the haplotype. Knowledge of haplotypes is useful for identifying association between genotype and disease. The haplotype phase problem is to find the true combination of variants from individuals. Existing methods based on second generation sequencing (SGS) data call variants at single sites, but cannot capture multiple variants due to short read length of SGS. In addition, two haplotypes of a gene can be expressed non-equally, which is called allele-specific expression (ASE). We extend the model by Bansal et. al. for inferring haplotypes from whole-genome to transcriptome sequencing data. We utilize MCMC to derive the most likely haplotype, phase each read to a haplotype, and estimate ASE. Our model has the flexibility to incorporate data from various platforms including third generation sequencing (TGS) data. The incorporation of TGS data, which is of longer length than SGS data, improves our models accuracy. Our model correctly identifies haplotypes and estimates ASE in simulated data. Finally, we apply our model to data from human embryonic stem cells, which harbor extensive ASE events.
|
ASA Meetings Department
732 North Washington Street, Alexandria, VA 22314
(703) 684-1221 • meetings@amstat.org
Copyright © American Statistical Association.