Short Courses
June 3–4, 3:00 p.m. – 6:30 p.m.
SC1 - Big Data, Data Science, and Deep Learning for Statisticians
Instructor(s): Ming Li, Amazon
With the recent big data, data science, and deep learning revolution, companies across the world are hungry for data scientists and machine learning scientists to bring actionable insight from the vast amount of data collected. In the past couple of years, deep learning has gained traction in many application areas and become an essential tool in the data scientist’s toolbox.
In this course, participants will develop a clear understanding of the big data cloud platform and technical skills in data science and machine learning. They will use hands-on exercises to understand deep learning. We will also cover the “art” part of data science and machine learning so participants learn the typical data science project flow, general pitfalls in data science and machine learning, and soft skills to effectively communicate with business stakeholders.
The big data platform, data science, and deep learning overviews are specifically designed for an audience with a statistics education background. This course will prepare statisticians to be successful data scientists and machine learning scientists in various industries and business sectors with deep learning as a focus. Please have a laptop available for hands-on sessions. No software download or installation is needed.
June 3, 3:00 p.m. – 6:30 p.m.
SC2 - Introduction to Programming Quantum Computers
Instructor(s): Mark Fingerhuth, Quantum Open Source Foundation and ProteinQure
Quantum computing isn’t science fiction anymore. IBM, D-Wave, and Rigetti all provide cloud access to their quantum processing units (QPUs). Have a laptop available! We will talk about the basics of quantum computing and how to implement an algorithm on actual quantum hardware. We will focus on Rigetti’s Forest SDK, a set of Python libraries designed to interact with QPU, and practical quantum computing, rather than theory. Participants will learn about the following:
- The notion of a quantum bit
- Different quantum computing architectures
- Various quantum logic operations and how to implement them in code
- Rigetti’s Python API to interact with the quantum device
- How to write and execute a quantum program
SC3 – How to Create a Development Environment for Reproducible Research
Instructor(s): Brian Lee Yung Rowe, Pez.AI
Winston Churchill observed that “we shape our buildings, and afterwards our buildings shape us.” The same is true of our development environment, which shapes our development process. Ad hoc and unstructured environments lead to unstructured processes that are difficult to reproduce. This short course leverages the author’s crant toolkit and shows how to use Docker, git, make, and other tools to create a development environment optimized for reproducible research. At the end of the course, you’ll be able to create a reusable environment that automates testing, packaging, report generation, and more. You’ll also learn how to incorporate notebooks into your development process in a way that maintains reproducible research.
SC4 - Recommendation Systems and Reinforcement Learning for Data Scientists
Instructor(s): Ying Lu, Google; Wutao Wei, Twitter
We all hear about data science technology. What is data science? How does data science change the world around us? This short course serves as an introduction to a combination of practical data science technologies with a focus on experimentation, recommendation systems, and reinforcement learning. We will talk about how these core technologies help build a great product. At the end of the course, the audience is expected to have a clear understanding of various data science technologies and applications. Both lectures and lab exercises will be offered.
June 4, 3:00 p.m. – 6:30 p.m.
SC5 - Building Advanced Computer Vision Models Using SAS Software
Instructor(s): Robert Winston Blanchard, SAS
Deep learning is an area of machine learning that has become ubiquitous with computer vision. The complex, brain-like structure of deep learning models is used to find intricate patterns in large volumes of data. These models have greatly improved the performance of general supervised models, time series, speech recognition, object detection and classification, and sentiment analysis.
Computer vision technologies are being used in new applications across industries to solve both familiar and unfamiliar problems. For some tasks, computer vision models have surpassed human accuracy.
In this workshop, participants will learn the pivotal aspects of a deep learning model (it’s not all about the hidden layers), learn the building blocks of a convolutional neural network, and discover how to apply a computer vision model to solve image classification tasks and an object detection task. The importance of recent advancements in the field of computer vision will be engaged during the model-building process.
Demonstrations will use SAS Cloud Analytic Services (CAS) to take advantage of the in-memory distributed environment. CAS provides a fast and scalable environment to build complex models and analyze big data by using algorithms designed for multithreaded parallel processing. Graphics processing units (GPUs) are leveraged for larger models demonstrated in this session.
SC6 – Data Science Workflows Using R and Spark
Instructor(s): Jim Harner, West Virginia University
R is a flexible, extensible statistical computing environment, but it is limited to single-core execution. Spark is a distributed computing environment that treats R as a first-class programming language. This course introduces data structures in R and their use in functional programming workflows relevant to data science.
The course covers the initial steps in the data science process: extracting data from source systems; transforming data into a tidy form; and loading data into distributed file systems, distributed data warehouses, and NoSQL databases (i.e., ETL).
These R-based workflows are illustrated by using dplyr directly and as a frontend to SQL databases. The sparklyr package with its dplyr interface to Spark is then used for modeling big data using regression and classification supervised learning methods. Unsupervised learning methods, such as clustering and dimension reduction, are also covered. Finally, methods for analyzing streaming data are presented.
Student accounts are provided to allow attendees to interactively run the R Markdown content in Amazon’s cloud (AWS). The computing infrastructure and content is containerized, which allows the complete course environment to be downloaded and run on Docker-supported laptops.
June 4, 3:00 p.m. – 5:00 p.m.
SC7 – Visualizing Big Data
Instructor(s): Leland Wilkinson, H2O.ai and University of Illinois at Chicago
Big data sets (many rows, many columns, many items, ...) present special problems for visualization. Even when trying to plot simple rectangular data sets, we encounter complexity (many functions are polynomial or exponential in rows or columns), the curse of dimensionality (distances approach a constant as dimensionality heads toward infinity), choke points (data bus or network bandwidth), and limited display resolution (even with megapixel displays). This workshop covers recent strategies that exploit aggregation and projection to reduce data sets to manageable proportions. It also covers graphic representations most suitable for exploring multivariate data.