Supervised machine learning will be central in the analysis of upcoming large-scale sky surveys. However, selection bias for astronomical objects yields labelled training data that is not representative for the unlabelled target data distribution. This affects the predictive performance with unreliable target predictions.
We propose a novel and statistically principled method to improve supervised learning under such covariate shift conditions, based on propensity score stratification, a well-established methodology in causal inference. We train learners on subgroups ("strata") conditional on the propensity scores, leading to improved covariate balance and much-reduced bias in the model fit.
We demonstrate that our general-purpose method has promising applications in observational cosmology, by improving upon existing conditional density estimation of galaxy redshift from Sloan Data Sky Survey (SDSS) data, as well as improving classification of Supernovae (SNe) type Ia, obtaining the best reported AUC (0.977) on the “SNe photometric classification challenge”. We discuss the embedding of such a classification into a full analysis of SNe data to estimate cosmological parameters.