Machine learning is now being used across the entire scientific enterprise. Researchers commonly use the predictions from random forests or deep neural networks in downstream statistical analysis as if they were observed data. We show that this approach can lead to extreme bias and uncontrolled variance in downstream statistical models. We propose a statistical adjustment to correct biased inference in regression models using predicted outcomes—regardless of the machine-learning model used to make those predictions.
This is also the first crack at a big open problem in statistics - what do we do with machine learned outcomes? covariates? both? I think there is a ton for (bio)statistics students to sink their teeth into as well!
|