Keywords: Massive Computational Experiments, Experiment Management System, Big Data, Cloud Computing
Ambitious data science requires massive computational experimentation; the entry ticket for a solid PhD in some fields is now to conduct experiments involving several millions of CPU hours. Traditional computing models in which researchers use their laptops or campus-wide shared HPC clusters with limited compute resources and shared policies can be inadequate and poorly configured for doing experiments at the massive scale and varied scope that we see in modern data science. On the other hand, the advent of cloud computing offers virtually infinite computational resources which can be custom configured to provide a powerful medium for data-driven science.
The prospect of actually conducting massive computational experiments in the cloud -- as promising as it may seem in the abstract -- seems to confront the potential user with daunting challenges. The user schooled in interactive laptop-style computing would see a cloud experiment as a massive collection of moving parts seemingly requiring a lengthy manual process that would simply wear the experimenter down, sapping the ability to think clearly and the desire to persevere. This is largely due to (i) the seeming complexity of today's cloud computing interface, (ii) the difficulty of executing and managing an overwhelmingly large number of computational jobs and (iii) ensuring that the results of the experiments can be understood and/or reproduced by other independent scientists. Starting a massive experiment `bare-handed' is thus highly problematic. Users who try it often `burn out' and despite the knowledge and skill gained, lose any desire to do further such experiments. In this article, we present several painless computing models that abstract away many of these difficulties, thereby allowing data scientists to effortlessly conduct massive and decisive scientific experiments in the cloud.