Short Courses
SDSS 2025 will offer one full-day and four half-day courses on Tuesday. Short courses are ticketed events that require an additional fee.
Full-Day Course
8:30 a.m. – 5:30 p.m.
Time Series Applications with a Machine Learning Framework (Beginner to Intermediate)
Instructor(s): Sean McMannamy, Graceland University
This course is an introduction to time series and its uses with a machine learning framework, including terminology, applications, and how they combine. The course will begin with an introduction to time series analysis and machine learning in a theoretical manner presented for a general audience. The course will then dive into the applications of times series analysis using machine learning models and their evaluation methods.
This course is intended for practicing and aspiring data scientists and analysts who want to learn how to apply time series and machine learning to their work. It will provide learners with an understanding of time series analysis, basic machine learning methods, and their potential for use in a variety of applications. Learners will also gain hands-on experience with applying these methods using R, specifically including the components of time series, machine learning models, feature selection, and model evaluation.
No experience with the topic will be expected. However, to benefit maximally from the tutorial, learners should have some familiarity with basic statistical modeling in R.
This tutorial will be presented with R markdown files, allowing participants to run examples and exercises on their own computers. A GitHub repository will be available two weeks prior to SDSS with instructions for setting up the R environment to run the tutorial locally.
Morning Half-Day Courses
8:30 a.m. – 12:30 p.m.
Building Containerized Applications for Data Science (Beginner to Intermediate)
Instructor(s): J. Alex Hurt, University of Missouri
Ensuring data science applications and libraries can be used in a wide variety of computing environments is crucial. Containerization offers a standardized and reproducible way to deploy data science applications. From individual laptops and workstations to public cloud computing infrastructure to on-premise compute clusters, containerized applications can be deployed reliably.
In this course, researchers and practitioners will learn to leverage containerization to build portable container images of their data science applications that can be shared and deployed around the world. While this course will use the Docker container runtime, the concepts are applicable to all container runtimes.
Each step of the deployment will be covered, including an introduction to containerization, building container images, building and pushing custom containers to image registries, and sharing custom container images from open-source container registries.
This is a beginner course, and therefore no previous experience with Docker or containerization is needed. Some familiarity with basic Linux commands and Git would be helpful.
This course will be taught using a combination of lecture slides for introducing concepts and Jupyter Notebooks for hands-on applications of the concepts. Attendees will only need a laptop with a web browser. All materials used will be published to GitHub following the course so attendees can refer to them later.
Introduction to Interpretable Machine Learning Using SHAP, GINI, and LIME (Beginner to Intermediate)
Instructor(s): Debarshi Datta, Florida Atlantic University
Humans are challenged to understand and retrace the decision-making process of artificial intelligence solutions as AI models advance across industries and research organizations. Interpretable AI, which involves techniques and approaches designed to make AI models’ decision-making processes comprehensible to humans, is useful for tackling such a challenge. Data scientists and machine learning experts can develop more transparent and reliable models by incorporating explanation mechanisms into these systems. This enhanced clarity benefits various stakeholders, including developers, regulatory bodies, and end users. Interpretable ML methods are crucial for understanding and explaining complex models used in data science.
In this course, I will explore the definition of an explainable AI/ML model, highlight its significance, and illustrate its objectives and benefits. Following that, I will look at SHAP (SHapley Additive exPlanations), GINI, and LIME (Local Interpretable Model-agnostic Explanations)—three powerful tools that help practitioners interpret the inner workings of ML models and understand the impact of individual features on model predictions.
Explainable AI principles enhance transparency, allow stakeholders to grasp how models make decisions, and foster fairness and trust. These models must treat everyone equally, including individuals from protected groups (defined by race, religion, gender, disability, or ethnicity). The models must be confident and robust, capable of handling noise, uncertainty, and unforeseen circumstances. By leveraging interpretable ML techniques, we can identify biases in our models and work toward improving their fairness.
This course is intended for data scientists and analysts who want to understand how to interpret black-box models such as ensemble models, decision trees, and random forests. Participants will grasp how these tools can decipher model behavior and enhance transparency, especially in applications where decision-making must be justified.
SHAP values provide consistent and accurate feature attribution. GINI helps assess feature importance (often used in decision trees), and LIME helps interpret individual predictions by approximating a complex model locally with a simpler one. Through the course, participants will acquire hands-on experience by using these methods to work through real-world examples.
Experience with SHAP, GINI, or LIME is not necessary; however, attendees must possess basic knowledge of ML models to maximize the tutorial’s benefits. Familiarity with core Python data science libraries such as NumPy, Pandas, and Jupyter Notebook is essential.
The tutorial will be presented in Jupyter Notebook, enabling participants to follow along, execute examples, and finish exercises independently. A GitHub repository will be available after completion of the workshop, providing instructions for setting up the Python environment and required packages.
This tutorial focuses on practical, hands-on learning rather than in-depth theory. I intend to start with basic examples to understand how each method works, then proceed to more sophisticated real-world models where interpretation is crucial. Upon completing the tutorial, participants will understand when and how to use SHAP, GINI, and LIME to build interpretable machine-learning models and gain valuable insights into the efficient application of these techniques.
Afternoon Half-Day Courses
1:30 p.m. – 5:30 p.m.
Accelerating Data Science Workflows with Kubernetes (Intermediate to Advanced)
Instructor(s): J. Alex Hurt, University of Missouri
As the age of big data continues to mature, the amount of compute required to process the vast amounts of data collected and analyzed each day continues to grow. Data scientists, therefore, must adapt their workflows to meet the demands of these larger data sets and big data problems. To that end, both the National Science Foundation and commercial cloud vendors have invested millions of dollars into building large-scale compute clusters for large-scale data processing. To use these clusters for data science, applications and libraries can be containerized and then deployed via Kubernetes.
In this course, attendees will be introduced to Kubernetes, including its architecture and design, core concepts, and the fundamentals of the Kubernetes client. Additionally, attendees will perform hands-on activities on an NSF-funded Kubernetes cluster, including creating persistent storage, spawning a pod, and creating a job. Finally, we will discuss automating the job creation process using Python to further accelerate data science workflows.
This is an intermediate course, and prior experience with containerization is required. Additionally, some familiarity with Python 3, basic Linux commands (ls, cd, etc.), and Git would be helpful.
This course will be taught using a combination of lecture slides for introducing concepts and Jupyter Notebooks for hands-on applications of the concepts. Attendees will only need a laptop with a web-browser. All materials used for the course will be published to GitHub following the course so attendees can refer to them later.
Integrating Large Language Models in Introductory Data Science Courses (Beginner to Intermediate)
Instructor(s): Jeanne McClure, NC State University; Sunghwan Byun, North Carolina State University; Joe Faith, Harding University; Zarifa Zakaria, North Carolina State University; Matthew Ferrell, North Carolina State University
In this short course, we invite participants who would like to enhance their pedagogical content knowledge around applying large language models in introductory data science courses for novice programmers from a wide range of disciplinary backgrounds. We will discuss real-world experiences of instructors incorporating LLMs such as ChatGPT, CoPilot, and Gemini to support student learning. This session will focus on designing student support through prompting techniques in chatbots and using GenAI assistant tools embedded in IDEs. We will also discuss ways to engage students in critically examining LLM outputs for accuracy and efficiency.
This half-day course is suited for educators who teach data science or statistics courses with a coding element and are curious about integrating LLMs but may feel hesitant due to potential complexities. This course aims to offer a balanced view, highlighting both successes and obstacles.
Whether you’re looking to start using these tools or just want to learn more about their application in education, this course will provide nuanced insights and concrete examples to incorporate LLMs into introductory courses.
Some familiarity with IDEs and LLMs is assumed.