Abstract:
|
Public genetic data enables efficient and more equitable access, transforming genetic and medical research. Due to privacy concerns, data is often provided by group genotype frequency rather than individually. Grouping can mask important information, such as fine-scale ancestry, and imprecise ancestry information may lead to misdiagnoses and incorrect genetic associations. We present a method to estimate hidden ancestry proportions in genotype frequency data. With more ancestries and therefore dimensions in the data, estimating these proportions quickly and precisely is problematic. We employ Sequential Least Squares Quadratic Programming (SLSQP), an iterative minimization algorithm for constrained, nonlinear problems. Grid search took >1 hour to produce estimates for 6 ancestries at a 1% precision; SLSQP gives results in seconds at < 0.1% precision. We apply our method to open databases including the genome Aggregation Database (gnomAD) v2.1 African sample (N = 12,487) where we find only ~85% African ancestry with the remaining ancestry from mostly Europe. Our method and accompanying R and Python packages provides precise ancestry information for growing open genetic resources.
|