This short course covers the data science process using R as a programming language and Spark as a big-data platform. Powerful workflows are developed using the tidyr, dplyr, ggplot2, and sparklyr packages. Examples show how data is transported to and extracted from persistent data stores such as the Hadoop Distributed File System (HDFS), NoSQL databases, and relational databases. These data-based workflows extend to machine learning algorithms, model evaluation, and data visualization. TensorFlow for deep learning is introduced. Big-data architectures are discussed including the Docker containers used for building the course infrastructure called rspark (https://github.com/jharner/rspark). Attendees can optionally install Docker containers on their desktop or deploy them to Amazon Web Services (AWS) prior to the course (see the rspark repo).
In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. Although H2O has made it easier for practitioners to train and deploy machine learning models at scale, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular, are notoriously difficult for a non-expert to tune properly. In this course, we provide an overview of the the field of "Automatic Machine Learning" and introduce the new AutoML functionality in H2O. H2O's AutoML provides an easy-to-use interface which automates the process of training a large, comprehensive selection of candidate models and a stacked ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard. H2O AutoML is available in all the H2O interfaces including the h2o R package, Python module and the Flow web GUI. We will also provide code examples to get you started using AutoML.
Join your peers at a Cloudera hosted short course to discuss your Data Science needs across your organization. Machine learning and Data Science are all about the data, but it's often out of reach for analytics teams working at scale.
Together we'll explore how to leverage powerful open source tools to create a machine learning mixture that balances data scientists' need for data access and flexible tooling with IT needs for security and governance. Cloudera Data Science Workbench enables fast, easy, and secure self-service data science in a collaborative environment.
Ultimately you'll walk away prepared to discover a new way to find value in your data and deliver increased value to your organization.
Shiny is an R package that makes it easy to build interactive web apps straight from R. You can host standalone apps on a webpage or embed them in R Markdown documents or build dashboards. This short course will introduce you to the basics of building web applications with Shiny, essentials of reactive programming, and how to customize and deploy your apps for others to use. Please bring a laptop with you to the course.
E-Poster session will take place from 5:45 p.m. - 6:45 p.m.