Conference Program Home
  My Program

All Times EDT

Abstract Details

Activity Number: 522 - Life Science Applications of Data Science
Type: Contributed
Date/Time: Thursday, August 11, 2022 : 8:30 AM to 10:20 AM
Sponsor: Section on Statistical Learning and Data Science
Abstract #323207
Title: Evaluating Model Diagnostics Tools for Non-Genetic Associations in Large Scale Data Sets
Author(s): Stella Aslibekyan* and Christophe Toukam Tchakoute and Teresa Filshtein Sonmez and Robert Gentleman
Companies: 23andMe, INC and Stanford University and 23andMe, INC and Center for Computational Biomedicine, Harvard Medical School
Keywords: Big Data; Model Diagnostics; Residual confounding; Goodness of fit
Abstract:

As the numbers of digital health cohorts and electronic medical records rise, participation in them is bringing sample sizes to unprecedented levels (N>500K) in clinical research. At this novel scale, we do not know how general misspecifications of a model’s form impact the validity of the findings. The use of Environment Wide Association Studies (EnWAS) on large scale datasets such as that collected by 23andMe, Inc. or UK Biobank has allowed new hypothesis generation regarding the contribution of different variables to specific conditions. However, in such large scale analyses, important continuous variables such as age and BMI are usually adjusted for using linear terms without any in-depth goodness of fit evaluation. In this study, guided by results from the 23andMe EnWAS (~1000 outcomes fit to ~1000 predictors +covariates = 1M+ models), we provide a set of best practices on checking assumptions/goodness of fit/evaluating effects in large scale analyses. We also show that by modeling continuous variables such as age or BMI in their proper functional form, we reduce spurious associations due to residual confounding. Finally, we validate all our results with UK BIOBANK.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2022 program