Abstract:
|
Electronic health records (EHRs) linked to blood samples form a powerful data resource for testing for associations between genotypes and phenotypes, provided that accurate information about phenotypes can be extracted from the health records. Some existing strategies require validation sets with "gold standard" phenotypes, but these can be time-consuming to create, which is especially prohibitive when many phenotypes are of interest such as in phenome-wide association studies (PheWASs). Other strategies identify cases based on thresholding counts of billing codes related to each disease; these strategies are rapid but produce inaccurate phenotyping which may compromise statistical power. We propose a new method to perform genetic association tests in this setting that better leverages information in the billing code counts. The method employs unsupervised clustering to separate patients into two groups based on diagnosis codes. Subjects are assigned a probability of being a disease case based on that clustering. The method is rapid, and can improve power to detect known associations over the standard methods based on thresholding.
|