Keywords: determination index, variable selection, ultra-high dimensional data, machine learning
In this presentation we introduce and discuss a new concept called Genomic Determination Index (GeDI), to address two questions in model selection in large-scale statistical genomics: (1) how much variability in a phenotype can be explained by large sets of diverse genomic factors that may total up to a few millions; (2) what specific genomic factors are largely responsible for the explained phenotypic variation? Similar to Heritability in quantitative genetics, GeDI is a measurement of the proportion of the phenotype variance attributable to the variations in a set of genomic factors under an assumed population model. No existing large-scale sparse regression or machine-learning method can effectively address these questions. A method to estimate GeDI is presented. This method consists of three steps: initial variable screening, regression modeling with forward variable selection driven by increments in GeDI, and a permutation analysis to correct selection bias. The entire development will be illustrated and evaluated by a diverse dataset from a study of ex vivo sensitivity of acute lymphoblastic leukemia cells to glucocorticoid treatment. The genomic factors consist of mRNA and microRNA expressions, DNA methylation markers, SNPs, and copy number variations. Some simulation results will be presented if time permits.