All Times EDT
Virtual
A Tailored Machine Learning Oracle to Guide Statistical Data Processing (308519)
Michael Frey, National Institute of Standards and Technology - Statistical Engineering Division*Mary Gregg, National Institute of Standards and Technology - Statistical Engineering Division
Lucas Koepke, University of Colorado/National Institute of Standards and Technology
Keywords: Machine learning, data processing, Monte Carlo resampling, isotonic regression
Some processing of a data set preparatory to formal inference can be helpful. But more than one processing method may be available for the given data, requiring an analyst to choose the most appropriate procedure. The best choice may be unclear, as the relative performance of different methods can depend on circumstantially specific factors that are unknown. We propose a novel methodology that combines machine learning (ML) with resampling to develop a statistical oracle to guide researchers. By Monte Carlo resampling from the given data set, we create training data for a ML algorithm to predict for different processing methods the error associated with the planned statistical analysis. These error predictions, tailored to the setting in question, allow the oracle to assess the efficacy of competing processing methods, providing a data-specific recommendation on the method most likely to reduce analysis error. We present preliminary results on the proposed oracle’s effectiveness in the setting of change-point estimation in isotonic regression where the pool-adjacent-violators-algorithm is available as a preparatory step prior to model inference.
