Abstract:
|
Observational health data such as electronic health records and insurance claims data provide the opportunity to study many questions, including fitting prediction models for health outcomes of interest, potentially including tens of thousands of covariates. Traditionally, these analyses are completely data-driven, for example using L1 regularized regression with identical priors on all covariates. Here we propose to integrate existing knowledge into our priors. We automatically extract known risk factors for thousands of diseases from Wikipedia by utilizing page-to-page links and page-to-code (e.g. ICD-10 codes) links. Priors for risk factors are still centered on 0, but their variance is driven by a second hyperparameter. Cross-validation is used to select both hyperparameters. As a proof of concept we fit predictive models for cardiovascular events in diabetes populations using large claims databases. Results show that the selected variance for risk factors is much larger than non-risk factors, that predictive accuracy is comparable between informed and uninformed models, but the resulting models are more parsimonious and are more likely to include the known risk factors.
|