Online Program

Return to main conference page
Friday, May 18
Data Science
Data Science Platforms II
Fri, May 18, 3:30 PM - 5:00 PM
Grand Ballroom G

Using Microsoft ML Server and Spark for Distributed Computation of Massive Computational Experiments in Data Science and Statistical Inference (304697)

*Ali Zaidi, Microsoft AI and Research 

Keywords: distributed systems, spark, parallel computing, MCMC, mixture models, inference

The availability of faster processors, cheaper cloud servers, and ever increasing amounts of data have enabled practitioners of data analysis to expand the capacity of their models and the speed of training and evaluation. However, notably absent in the recent surge of data tools is the ability to conduct massive experiments for the particular sake of statistical inference for causality and decision making. Indeed, a reader of a statistics textbook in the decade past would likely find a passage stating: "[f]or statistics the central theme is statistical inference leading to experimental design and the determination of cause-effect relations" (Fraser, 1976), but a modern practitioner of the field may be puzzled as to how to ask inferential questions of their deep neural network. In this presentation, we examine modern tools for conducting experiments in the cloud for the purpose of statistical inference.We also examine how we can leverage R and Python to conduct massive distributed experiments, and expand their capabilities using the distributed capabilities of ML Server and the Azure cloud infrastructure. Scalable and distributed implementations of the bootstrap, probabilistic inference using gradient based samplers, and interpretable methods for deep neural networks will be presented in hands-on examples in both Python and R.

Fraser, D.A.S. 1976. "Probability and Statistics: Theory and Applications." Institute of Theoretical Statistics.