Abstract:
|
Our science goal is to identify very faint galaxies that occur at low redshifts (where redshift is a proxy for distance to the observer). Given candidate galaxies, the astronomers then aim to follow-up with higher-resolution spectroscopy. As these follow-up studies are expensive and limited in size by experimental design, there is a tradeoff between cost and recall. In addition to the cost-recall tradeoff, there is also the statistical challenge of correctly classifying objects for imbalanced data with very few actual positives, and the problem of uncertainty quantification in settings with low signal-to-noise ratio and degenerate solutions. In this work, we develop algorithm-agnostic budget-aware strategies for selecting follow-up candidates, as well as strategies for data augmentation and nonparametric conditional density estimation for classification with imbalanced data. Although our main application is in astronomy, our proposed methods apply generally to detection problems in, e.g., credit card fraud and medical diagnosis, involving few actual positives and limited (monetary or time-wise) budget for collecting new data.
|