Keywords: air pollution, dimension reduction, principal component analysis, missing data, latent variable model, spatial misalignment, universal kriging
Environmental studies often focus on the health impacts of long-term air pollution exposure on human subjects. Pollutant concentrations are measured at regulatory monitoring locations, which are usually located at different locations than the study subjects. This spatial misalignment motivates a two-stage modeling approach with an exposure model and a health regression model. In addition, air pollution is often a mixture of many components with different health implications. Conventional approaches incorporate techniques such as principal component analysis (PCA) to obtain a lower-dimensional representation of the data. Recently developed predictive PCA modifies the optimization criterion to improve the exposure model. However, these approaches require complete data. Real-world data tend to have complex missing patterns, including some pollutants that are measured at relatively few locations and some locations with many missing measures. We propose a probabilistic version that allows for flexible imputation to utilize all available monitoring data. We demonstrate the performance of probabilistic predictive PCA with simulations and analysis of multivariate air pollution data.