Abstract:
|
Due to the decreasing cost high throughput sequencing and genotyping, large scale biobank datasets with hundreds of thousands of sequenced samples become available. When connected with electronic medical record, there can also be thousands of traits. Together, biobank scale datasets may contain up to 10^16 data entries, about ~10,000 times bigger than a GWAS datasets with 10,000 samples and 1 million genotyped variants. It may take 2 CPU years to complete the standard association analysis for all traits in a biobank scale dataset. The biobank scale datasets quickly outdates existing software packages. There is a compelling need to develop more efficient tools that can scale well with ultra-large scale datasets from modern genetic studies. To address this research need, we develop a novel statistical method that make use of sufficient statistics to maximize dimension reduce, eliminate redundant computation while retaining all necessary information for association analysis. The methods can be hundreds times faster than the fastest available tools such as PLINK2. We expect that the new tool will play an important role in next generation sequencing and EHR-based genetic studies.
|