Abstract:
|
Genotyping and single nucleotide polymorphism (SNP) discovery via short read technology are vital for selection and breeding in crop species. These tasks are challenging in allotetraploids, like cotton, peanut, where alignment ambiguities lead to an excess of heterozygous calls. We propose a model-based method to infer the genomic origin of each read and genotype allotetraploids in targeted resequencing projects. We use a multinomial logistic regression model for sequencing errors with Poisson process to penalize indels, estimate the parameters via an EM, and determine optimal haplotypes by maximizing the likelihood. This method is implemented in Rcpp taking a SAM alignment file as input. Peanut resequencing data of several inbred lines, where heterozygosity is not expected, were used to validate our method. When applied to 12 target locations, each about 400bp, for one inbred line, our method demonstrates high accuracy in calling SNPs. Only 2 heterozygous calls were made in two targets, which compares favorably to samtools based genotyping, which called around 40 heterozygous positions in one target location and provides the typical input for SNP calling in allotetraploids.
|