Abstract:
|
The whole genome-wide data with millions of single nucleotide polymorphisms (SNPs) can be highly correlated due to linkage disequilibrium (LD). The ultra-high dimensionality of big data brings unprecedented challenges to statistical modeling such as noise accumulation, curse of dimensionality, computational burden, spurious correlation, processing and storage bottlenecks, and so on. The traditional statistical approaches lose their power due to n >> p and the complex correlation structure among SNPs. We propose an integrated DC-RR approach to accommodate both the ultra-high dimensionality and the complex correlation structure. First extensively selecting the most important candidates and removing the noise via a Distance Correlation based feature screening approach. Second intensively addressing the correlation structure using the ridge penalized multiple logistic regression model. The gain in power, steady type I, and an especially dramatic decrease in computational time were verified through several simulations. The Arabidopsis data with 84 individuals and 216,100 SNPs was analyzed and significant SNPs were detected.
|
ASA Meetings Department
732 North Washington Street, Alexandria, VA 22314
(703) 684-1221 • meetings@amstat.org
Copyright © American Statistical Association.