Abstract:
|
It is challenging to perform high-dimensional hypothesis testing in high-throughput genomics data setting, featured by large p and small n. To boost the statistical power and reduce the cost of biological experiments, we propose a new statistical strategy by leveraging public controls in a case-control study with limited sample sizes. In our two-stage EWAS study of 44 pancreatic cancer cases and 20 controls at the MD Anderson Cancer Center, we increased the number of controls from 20 to 556 by integrating public data, the Framingham Heart Study, with the MDACC controls. We successfully removed the batch effects between the two datasets from different resources, as shown by the visualization of unsupervised learning. In the validation stage, we replicated 6 significantly differentially methylated CpG probes (DMPs) and 3 regions (DMRs). By performing causal inference using Mendelian randomization, we found evidence of directional relationships of the associations between DMPs and pancreatic cancer. RNA-sequencing analysis also illustrates the functional consequences of DMPs/DMRs on the cancer risk.
|