Abstract:
|
Ancestry inference in genetic studies has been routinely performed for the purpose of quality control and association analyses. We present our support-vector-machine (SVM)-based method to identify the most likely ancestral group(s) for an individual by leveraging known ancestry in a reference dataset (e.g., the 1000 Genomes Project data). Our method involves first projecting each study sample to the principal component (PC) space of a reference dataset, followed by training and classifying the ancestry of each study sample using an SVM algorithm. This algorithm has been integrated in the computationally efficient tool, KING, and the implementation is scalable to large datasets containing over one million individuals. We assessed the performance of our algorithm using 13,181 subjects who were genotyped with the Illumina HumanCoreExome Beadarray as well as the Illumina ImmunoChip Beadarray. We predicted ancestry for 469,660 subjects in the UK Biobank. Of 441,441 reporting white ethnicity, 99.9% were classified as European; of 10,971 reporting Asian ethnicity 97.0% were classified as South or East Asian; and of 7,637 reporting black ethnicity, 97.2% were classified as African.
|