Online Program

Return to main conference page
Friday, May 18
Data Science
Data Science Foundations
Fri, May 18, 1:30 PM - 3:00 PM
Lake Fairfax B
 

A Paradigm for Research in Data Science (304648)

Presentation

David Donoho, Stanford University  
XY Han, Stanford 
Hatef Monajemi, Stanford University 
*Vardan Papyan, Stanford 
Qingyun Sun, Stanford 

Keywords: Deep learning; Computational experimentation; Data science

How does one do experimental research in data science? This is an important question with as yet many scattered examples but no settled answer. We propose a standard experimental approach, which we refer to as XYZ studies, that makes it easy to think up, conduct, analyze, report, and share results.

In this new paradigm a researcher first archives a set of datasets Y considered canonical for a certain task in a certain field, implementing all relevant methods X. The same experiment is then run on every XY combination while varying some control parameters Z. The researcher collects some observables W at each parameter variation, analyzes their behavior and reports findings.

The idea is motivated by scientific communities concerned with certain XY combinations such as deep learning, which is concerned with datasets Y={CIFAR, IMAGENET, …} associated with competitions such as ILSVRC; models that have won such competitions X={AlexNet, ResNet, …}; control parameters Z={depth, learning rate, …} that one can tweak to squeeze the best performance or study theoretical questions; and observables W={accuracy, activations, …}.

Following the XYZ experiments, the researcher observes some phenomenon and proposes a formal hypothesis to explain its causes. These are studied in sandboxes of synthetic and possibly analytic nature that create either artificial datasets Y or methods X under parametric control Z. The researcher then demonstrates that some control parameter Z of the sandbox correlates substantially with some observable W, possibly proving this theoretically, and then argues that the same explanation holds for the real data as well. The reproducible results of such experiments are then uploaded to an interactive website, which may lead to new findings, hypotheses and further exploration.

Ultimately, we deliver an understanding and a discipline of how to do feasible, focused and effective research in the new era of data science.