Abstract:
|
As the numbers of digital health cohorts and electronic medical records rise, participation in them is bringing sample sizes to unprecedented levels (N>500K) in clinical research. At this novel scale, we do not know how general misspecifications of a model’s form impact the validity of the findings. The use of Environment Wide Association Studies (EnWAS) on large scale datasets such as that collected by 23andMe, Inc. or UK Biobank has allowed new hypothesis generation regarding the contribution of different variables to specific conditions. However, in such large scale analyses, important continuous variables such as age and BMI are usually adjusted for using linear terms without any in-depth goodness of fit evaluation. In this study, guided by results from the 23andMe EnWAS (~1000 outcomes fit to ~1000 predictors +covariates = 1M+ models), we provide a set of best practices on checking assumptions/goodness of fit/evaluating effects in large scale analyses. We also show that by modeling continuous variables such as age or BMI in their proper functional form, we reduce spurious associations due to residual confounding. Finally, we validate all our results with UK BIOBANK.
|