Online Program

Genome-wide association studies for predicting hypertension: Comparing Support Vector Machines and Permanental Classification

*Hsin-Hsiung Huang, Mr. 
Jie Yang, Prof. 

Keywords: hypertension, SNP, genotype, phenotype, SVM, permanental classification

Background: Genetic Analysis Workshop 18 (GAW18) provides genome-wide association study (GWAS) data of 1043 individuals are from 20 Mexican American pedigrees enriched for type 2 diabetes from San Antonio, Texas. The data are longitudinal, with three measurements at four time intervals (1981 to 1996, 1997 to 2000, 1998 to 2006 and 2009 to 2011). Since there are missing observations in the original phenotype data, GAW18 also provide 200 replicates of simulated data which consist of 849 individuals who have hypertension status, GWAS genotypes (SNPs), age, sex, smoking status, parents and pedigree information.

Methods: In our analysis, we use the GWAS data of Chromosome 3 and the simulated phenotype data. The GWAS of Chromosome 3 contains 65519 single-nucleotide polymorphisms (SNPs). The goal of our analysis is to predict whether people will have hypertension. Hence, we use the simulated data. We choose different numbers of SNPs and compare the corresponding prediction error rates. We also compare the performance of support vector machines (SVM) and the newly developed permanental classification (PC) method given different number of significant single-nucleotide polymorphisms (SNP). that are selected by logistic regression. In the second step, we use the significant SNPs with support vector machines (SVM) and a newly developed permanental classification (PC) methods for prediction purpose. We also find rare variants and investigate their impact on prediction.

Results: We use the genotype SNPs and covariates Smoke, Age, Sex, interaction of Age and Sex, Mother, Father, Pedigree chosen by logistic regression for SVM and PC. The numbers of SNPs used for SVM and PC are 0, 5, 10, 15, 20, 50, 100 and 200. Without using SNPs, the error rates of SVM and PC are 24% and 26% respectively, while using the 100 significant SNPs the prediction error rates of SVM and PC are both reducing to 12%. Moreover, the rare variants only provide small improvements (1~2%) for prediction. Our results show that SVM and PC both outperform logistic regression, while SVM and PC are comparable in predicting hypertension status.

Conclusion: The significant common variants of SNPs are highly correlated with hypertension, so they help predict hypertension. Since rare variants only occur in some specific families, we may use them to find which SNPs may cause hypertension for some particular families. Collapsing methods that create dummy variables indicating the presence of every rare variant in a gene can be more powerful and many different such approaches are in the literature. Moreover, the error rates of SVM and PC are both close 12% when using 100 or 200 most significant SNPs. The testing error rate increases a little bit for SVM and the testing error rate decreases for PC, when the most significant 200 SNPs are used as predictors. This implies that over-fitting occurs for SVM in this situation.