Online Program

Return to main conference page
Thursday, May 30
Data Science Techologies
Practice and Applications
Data Science Applications E-Posters, I
Thu, May 30, 3:00 PM - 4:00 PM
Grand Ballroom Foyer

ClusterJob, an Experiment Management System For Ambitious Data Science (306303)

*Bekk Blando, Clemson University 
David Donoho, Stanford University Department of Statistics  
Hatef Monajemi, Stanford University Department of Statistics  

Keywords: Ambitious Data Science, Painless Computing Stacks, Cloud Computing, Experiment Management System, Massive Computational Experiments

Massive Computational Experiments (MCEs) are emerging as a new avenue for solving complex problems in science and engineering. They often require conducting experiments in the cloud at a historically unprecedented scale, which may seem daunting for many researchers. This is largely due to the complexity of cloud computing and the difficulty of managing a large number of computational jobs. In light of these issues, we present ClusterJob (CJ), an open-source Experiment Management System (EMS) that was developed at Stanford University to facilitate ambitious data science. Today, hundreds of computational researchers use CJ to conduct MCEs involving tens of thousands of computational jobs and millions of CPU hours. CJ abstracts away almost all the details of computation and storage requirements for an experiment, thereby allowing researchers to be ambitious with experimentation and to focus on developing bold research questions that can be settled empirically. Experiment management systems are fundamental tools in modern data science making massive experimentation accessible to everyone in the scientific community. CJ provides a Command Line Interface (CLI) that allows the users to run their experiment on a remote compute cluster painlessly. The users precisely define their experiment in a main script. CJ then parallelizes this script, executes all the jobs, harvests all the results, and finally creates a reproducible package that contains the complete code and data of the experiment. This package is immutable and tied to a unique ID that we call a Package ID (PID). We will also present CJHub, which provides automatic archival of the experiments run by CJ. CJHub removes the users’ burden of organizing and storing large amounts of data on their own. In addition, it allows the users to easily share their experiments with other CJ users using the unique PID’s of their computational packages.