Abstract:
|
There is growing evidence for the importance of social environment (i.e. the neighborhood in which one lives) in health outcomes and disparities. However, it is unclear which social-environmental measures are most relevant. Census data is often used to quantify neighborhood factors, as it includes many relevant area-level measures such as demographics, income/poverty, education, transportation, and housing. Using this data is challenging due to its dimensionality (>14,000 variables at the census tract level), and its complex correlation structure. In this work we adapt empiric computing approaches for variable selection for use with census data. Methods include penalized regression (lasso, elastic net), random forests, and cluster-based analysis. Using simulations, we test which methods are most effective at identifying variables truly associated with binary and continuous outcomes. We apply the most promising methods to an analysis of newly diagnosed prostate cancer cases in Pennsylvania, demonstrating how empiric models could be used to identify socio-environmental characteristics of high-risk neighborhoods and improve risk stratification.
|