Abstract:
|
Predictive models of gene expression, or transcriptomics, from genotyping or sequencing data are now widely used for the identification of genes involved in complex traits. One of the most popular transciptomic prediction methods is PrediXcan, where predictions are derived from supervised machine learning algorithms, such as LASSO and elastic net. PrediXcan models used training data from subjects with European ancestry, however, many genetic studies includes samples from ancestrally diverse populations. Using transcriptomic data from the GEUVADIS (Genetic European Variation in Health and Disease) RNA sequencing project and whole genome sequencing data from the 1000 Genomes project, we evaluate and compare the predictive performance of PrediXcan in diverse populations. We show that predictive performance varies across populations, with the Yoruban (YRI) sample from Nigeria having the lowest correlations between the observed and predicted gene expression values, on average. We also propose a new approach for modeling gene expression that incorporates ancestry in order to improve prediction accuracy in diverse populations, including populations with admixed ancestry.
|