Keywords: false discovery rate, generalized linear models, GWAS, knockoff variables
In genetics, researchers attempt to identify a subgroup of single-nucleotide polymorphisms (SNPs) associated with certain type of disease in order to further understand the disease mechanism. The problem can be modeled as a classification one, where we try to explain the binary response by a group of potential explanatory variables (SNPs). Generalized linear models (GLM) are widely used tools in these cases. However, there are still several open problems about how to make controlled variable selection for GLMs, especially under the high-dimensional setting. In this work, we introduce a variable selection approach for probit and logistic regression models. Built on the knockoffs framework (Barber and Candes 2015), our procedure starts by constructing a group of knockoff variables geometrically and then calculates the test statistics based on a Bayesian model. We show that the approach can achieve the false discovery rate (FDR) control asymptotically, without the normal distributional assumptions on the regression matrix. We conduct a range of numerical experiments to demonstrate the FDR control and the power of the proposed method. When applied to the dataset collected in Swedish schizophrenia population-based case-control exome sequencing experiment (Purcell et. al., 2014), with a target FDR 0.2 our method can identify about 10 genes associated with the disease within a candidate gene set of size 1796, while in previous study no individual gene-based test achieved significance after a Bonferroni correction.