Abstract:
|
In studying county-level variation in disease prevalence, small area estimation (SAE) techniques are used when survey sample sizes are too small to provide adequate direct estimates. For some small counties, the sample size may even be zero. SAE models borrow strength from neighboring counties using individual-level information on the outcomes of interest and auxiliary variables. However, heterogeneity of auxiliary variables across counties can introduce large variations and bias in SAE and thus diminish the estimates' accuracy. To address this issue, we developed a two-stage SAE approach: first, we used the Gaussian Expectation-Maximization mixture model (a machine learning technique) to cluster nearest neighbors among U.S. counties based on the county-level population characteristics (e.g., age, sex, and race) and socioeconomic factors (e.g., income, unemployment, etc.); then we applied Bayesian hierarchical models to estimate county-level prevalence by borrowing strength within the nearest neighbor cluster. The new approach was evaluated with both empirical and simulation data. Using this approach on average reduced the mean square of error by 23.8% and bias by over 1.5 times.
|