Short Courses
SDSS 2024 offered half-day short courses. Short courses are ticketed events that require an additional fee.
8:30 a.m. – 5:30 p.m.
SC1 – Probabilistic Programming and Bayesian Computing with PyMC (Beginner to Intermediate)
Instructor(s): Chris Fonnesbeck, Philadelphia Phillies
Bayesian statistical methods offer a powerful set of tools to tackle a wide variety of data science problems. In addition, the Bayesian approach generates results that are easy to interpret and automatically account for uncertainty in quantities we wish to estimate or predict. Historically, computational challenges have been a barrier, particularly to new users, but there now exists a mature set of probabilistic programming tools that is both capable and easy to learn. We will use the newest release of PyMC (version 5), but the concepts and approaches you will learn are portable to any probabilistic programming framework.
This course is intended for practicing and aspiring data scientists and analysts who want to learn how to apply Bayesian statistics and probabilistic programming to their work. It will provide learners with a high-level understanding of Bayesian statistical methods and their potential for use in a variety of applications. Learners will also gain hands-on experience with applying these methods using PyMC, specifically including the specification, fitting, and checking of models applied to a couple of real-world data sets.
This is an introductory course, therefore no direct experience with PyMC or Bayesian statistics will be expected. However, to benefit maximally from the tutorial, learners should have some familiarity with basic statistical modeling (e.g., regression and estimation) and core components of the scientific Python stack (e.g., NumPy, pandas, and Jupyter). For those with no Python experience, pre-conference tutorials will be posted to get you up and running.
This tutorial will be presented with Jupyter notebooks, allowing participants to run examples and exercises on their own computers. A GitHub repository will be available two weeks prior to SDSS with instructions for setting up the Python environment to run the tutorial locally.
As the goal of the tutorial is to get new users up and running with Bayesian methods, the content will be light on theory and focus on the implementation of models, though some statistical background will be provided for context and clarity. Since PyMC is a high-level statistical package, it is easy to gloss over important details of the underlying algorithms. Therefore, we will begin by solving a simple model using only NumPy and SciPy functions. As a capstone to the tutorial, learners will be introduced to “The Bayesian Workflow” to reiterate the important steps in the process, along with useful tips and tricks.
SC2 – Effective Graphics for Visual Communication with Data (Beginner to Intermediate)
Instructor(s): Susan VanderPlas, University of Nebraska-Lincoln; Kelly Bodwin, California Polytechnic University; Emily Robinson, California Polytechnic University
This course will focus on strategies for creating data visualizations that make it easy for collaborators to gain insight from data. We will discuss different ways graphics are used during the analysis process but primarily focus on graphics used to communicate with nonstatisticians: managers; stakeholders; and collaborators who may need to use graphics to make decisions and/or motivate changes. This course will also touch on topics such as accessibility and alt-text, which are essential to ensuring graphics meet regulatory requirements.
Learners should be familiar with plotting packages such as base R graphics, ggplot2, seaborn, and/or matplotlib, but this is not a “how to make graphics” course. Code for different plotting libraries will be provided and modified during the course.
This course is intended for practicing statisticians in industry, government, or academia who are responsible for communicating results of statistical analyses and data to nonstatisticians. Attendees should be able to read data in, clean it, and visualize it using the language of their choice.
Examples will be provided using ggplot2 code in R and seaborn or matplotlib in Python. We can assist students with R and Python code and will attempt to help with others; however, we do not promise familiarity with all programming languages in common use for data science.
8:30 a.m. – 12:30 p.m.
SC3 – Generative AI Fundamentals and Implementation of LLMs from the Ground Up (Intermediate to Advanced)
Instructor(s): Ginger Holt, Databricks
An introduction to Generative AI (GAI), including terminology, applications, essential considerations for adopting GAI in an organization, and evaluating potential risks and challenges associated with using or adopting GAI.
The details of the large language models (LLMs) foundation will be taught. You will learn the innovations that led to the proliferation of transformer-based architectures, from encoder models (BERT) to decoder models (GPT) to encoder-decoder models (T5). You will also learn about the recent breakthroughs that led to applications like ChatGPT. You will gain an understanding of the latest advances that continue to improve LLM functionality, including Flash Attention, LoRA, AliBi, and PEFT methods.
Experience with Python will be assumed.
Note: This course will include many demos with an option to execute code on your own computer.
1:30 p.m. – 5:30 p.m.
SC4 – Julia for Data Science (Beginner to Intermediate)
Instructor(s): Josh Day, Julia Computing
Julia is a relatively new language for technical and scientific computing. Its design offers greater speed and productivity over the more established languages in the data science space: R and Python.
In this course—tailored to those with a familiarity of data science and programming in a high-level language such as R, Python, or MATLAB—you’ll learn Julia from the ground up, starting with the basics and ending with using the core packages in Julia’s data science ecosystem. Importantly, you’ll learn how Julia’s language features can change your thinking about how to approach solving problems. You’ll also learn how to ingest/clean tabular data, fit models, visualize data, and perform other analytics tasks using a variety of Julia packages.