All Times ET
Back to search menu
Tuesday, February 1
Tue, Feb 1
10:00 AM - 5:30 PM
Virtual
SC01 - Essential Communication and Collaboration
Short Course (full day)
Instructor(s): Ilana A. Trumble, LISA-University of Colorado Boulder and UC Anschutz; Eric Vance, LISA-University of Colorado, Boulder
Statisticians and data scientists must communicate and collaborate with domain experts from many different fields in academia, business, and government. Learning more effective communication and collaboration skills will enable us to maximize our professional impact in these areas. In this short course, participants will learn and practice essential skills that will enable them to improve their communication and collaboration to add more value to their projects, customers, and organizations. We introduce the ASCCR framework that describes our current best practices for five aspects of statistical consulting and collaboration (Attitude-Structure-Content-Communication-Relationship). We will focus especially on the communication skills of asking great questions; listening, paraphrasing, and summarizing; and explaining statistics to non-statisticians to create shared understanding with our clients and collaborators. Participants will practice these skills via team exercises, role-plays, video coaching, and individual reflections to become more effective communicators and collaborators, enabling them to have greater impact in their roles as statisticians and data scientists.
Outline & Objectives
Our objective is to help participants improve their communication and collaboration skills so they can achieve greater impact. This short course will be useful for all levels from beginning to advanced. Prerequisites are a desire to improve one’s personal effectiveness and openness to try new methods and ways of thinking in the practice of statistics and data science.
1 Welcome, team assignments, and warm-up exercises
2 Introduction to ASCCR Frame
3 Attitude of effective collaboration (checklist and exercise)
4 POWER structure (Prepare-Open-Work-End-Reflect) produces effective meetings
5 Best practices for opening meetings (Eric and Heather mock role play, video review, then participants role play)
6 Q1Q2Q3 approach (reflection exercise)
7 Triangle of Statistical Communication
a. Asking Great Questions (participant role play)
b. Listening, Paraphrasing, Summarizing (video clip review)
c. Explaining Statistics to Non-statisticians (video clip and role play)
d. Creating Shared Understanding
8 Strengthening Relationships (reflection exercise)
9 Best practices for ending meetings (participants role play)
10 Individual plans for improving communication and collaboration.
About the Instructor
For the past 13 years, Dr. Eric Vance has been the director of LISA (Laboratory for Interdisciplinary Statistical Analysis) where he has trained 285 statisticians and data scientists to move between theory and practice to collaborate with 9700+ domain experts to apply statistics and data science to answer their research, business, or policy questions. He has taught workshops and webinars on collaboration in nine countries, including several in collaboration with Heather Smith at CSP and JSM. This workshop gets better every time they teach it.
Heather Smith has 30 years of experience consulting with academic, industrial, service, and government clients in the United States, Europe, and Asia. She began this work as a statistical consultant at Westat, Inc. For 23 years she has been a faculty member in the Statistics Department at Cal Poly San Luis Obispo where she consults with academic and private sector researchers and teaches a wide variety of applied statistics courses, including courses in statistical communication and consulting. She has offered over a dozen workshops, short courses, and webinars on these topics, and has trained hundreds of statistical collaborators.
Relevance to Conference Goals
This short course is relevant to Theme 1 and 4. Participants will learn new skills and practical tips to apply whenever they interact with other people. Participants will explicitly learn how to better communicate and collaborate with their clients and customers. Skills learned in the course will equip participants to have a positive impact on their organization and an upward career trajectory. Participants will return to their jobs with new ideas, techniques, and strategies to improve their ability to communicate and collaborate effectively, resulting in a greater impact on their organizations and increasing the overall impact of statistics and data science.
A version of this course was taught at the 2018 CSP and received a high average rating of 4.63 out of 5 (n=8 responding out of 22 participants). The official qualitative feedback we received: “This course is essential for any statistician who needs to collaborate with people in other disciplines, or sell their business to clients. I very strongly recommend it.” Unofficial feedback was very positive as well. A version of this course was also taught at 2020 CSP, but we don’t recall receiving any official feedback.
Tue, Feb 1
10:00 AM - 5:30 PM
Virtual
SC02 - Hands-On Introduction to Python in Predictive Analytics and Machine Learning
Short Course (full day)
Instructor(s): Mei Najim, The University of Chicago
This is an introductory course to provide a hands-on introduction to Python, the well-known open-source programming language for analytics. We will start with an introduction to Jupyter Notebook and Python basics, then the most popular data science libraries (Numpy and Pandas), data visualization libraries (Matplotlib, Seaborn), and machine learning library (Sklearn).
We will introduce a Predictive Analytics Life Cycle Process through a case study to methodically expose attendees to best practices and Python’s rich set of data science libraries, providing hands-on experience and know-how. Lastly, we will use the course material to develop a predictive model from raw data (data TBD). Python code will be provided.
Outline & Objectives
1. Introduction to Jupyter Notebook and Python Basics
2. Introduction to Data Science Libraries: NumPy and Pandas
3. Introduction to Data Visualization and Interactive Data Visualization Libraries: Matplotlib, Seaborn, Plotly, and Clufflinks
4. Introduction to A Life Cycle of Predictive Analytics Process through A Case Study and Machine Learning Library Sklearn
a). Data Exploratory Analysis and Data Pre-Processing
b). Supervised Learning: Regression (Linear, Multiple Linear, Polynomial Regression, Decision Tree, and Random Forest)
c). Supervised Learning: Classification (KNN, Logistic Regression, Decision Tree, and Random Forest)
d). Unsupervised Learning: K-mean Clusters and Principal Component Analysis (Dimensionality Reduction)
5. Using all the above to develop a Predictive Model (data TBD): start from Raw Data Exploratory Analysis, Data Visualization, Data Preparation, Feature Engineering, and Model Building (using Logistic Regression, Decision Tree, Random Forest and Model Performance Evaluation)
About the Instructor
Mrs. Mei Najim is currently teaching Programming for Analytics (R & Python) part time at The University of Chicago. Mei has 16 years of hands-on analytics experience in claim management, underwriting, pricing, reserving, and catastrophe risk management in the insurance industry and collections analytics in the banking industry. Since 2007, she has been mainly working and leading various levels of predictive analytics projects to develop analytics capability for financial organizations. She has frequently presented at conferences to share her expertise. Mei holds a BS degree in Actuarial Science from Hunan University and two MS degrees, one in Applied Mathematics and the other in Statistics, from Washington State University. Mei is a member of the American Statistical Association and a Certified Specialist in Predictive Analytics (CSPA) of the Casualty of Actuary.
Relevance to Conference Goals
The objective is to provide attendees with practical knowledge about using Python programming to analyze data and develop a life cycle predictive analytics through the application of state-of-the-art statistical methods and machine learning algorithms.
Tue, Feb 1
10:00 AM - 5:30 PM
Virtual
SC03 - Real-World Data and Evidence: An Interdisciplinary Approach and Applications to Precision Medicine and Healthcare
Short Course (full day)
Instructor(s): Jie Chen, Overland Pharma; Tze Leung Lai, Stanford University
Real world data and evidence (RWD&E) have been increasingly used in drug development and regulatory decision-making since the passage of the 21st Century Cures Act on December 2016 and the issuance of the FDA’s RWE framework in December 2018. Whereas pharmaceutical companies use RWD&E to support clinical development activities and to seek evidence to inform health technology assessment (HTA) decisions, the healthcare community uses RWD&E to develop guidelines and decisions to support medical practice and to assess treatment patterns, costs and outcomes of interventions. Although high performance computing tools, artificial intelligence and machine learning algorithms have been conveniently applied to RWD, there are still substantial challenges in deriving RWE from RWD and in using the RWE in drug development and healthcare decision-making. This short course aims to provide the audience with practical interdisciplinary approaches and applications using RWD&E in product development, regulatory decision-making, and healthcare delivery, with case studies given throughout the presentation.
Outline & Objectives
Course learning objectives: The audience will learn the commonly used as well as cutting-edge decision-analytics approaches that are tailored for specific questions in product development, and regulatory and healthcare decision-making. Case studies are given throughout the presentation of the short course to illustrate the applications of the methods.
1. Introduction
2. Real World Data
3. Statistical and Machine Learning Methods for Healthcare Decision Analysis
4. Disease Diagnosis, Patient Heterogeneity and Adherence
5. Health Technology and Health Economic Assessment
6. Risk Models and Outcome Prediction
7. Benet-Risks Assessment
8. Causal Inference Using Real World Data
9. Analysis of Data Generated from Mobile Devices
10. Public Health Surveillance and Pharmacovigilance
11. Real World Data to Support Clinical Development
12. Pragmatic Trials and CER Trials
About the Instructor
Presenters' background:
1. Tze Leung Lai, PhD: Ray Lyman Wilbur Professor of Statistics and of Biomedical Data Science in the School of Medicine and of the Institute for Computational & Mathematical Engineering (ICME) in the School of Engineering, Stanford University, Director of Financial and Risk Modeling Institute, and Co-Director of Center for Innovative Study Design at the Stanford School of Medicine, IMS and ASA Fellow.
2. Jie Chen, PhD: Senior Vice President and head of Biometrics, Overland Pharmaceuticals and a visiting member of the Center for Innovative Study Design, Stanford University.
3. Richard Baumgartner, PhD: Sr. Principal Scientist with Biometrics Research Department, Biostatistics and Research Decision Sciences (BARDS), Merck and Co.
Relevance to Conference Goals
This short course will provide the best practice of statistics in the areas of real-world data and evidence to support drug development and regulatory decision making.
Tue, Feb 1
10:00 AM - 1:30 PM
Virtual
SC04 - Skills for Statistical Writing: Tips and Tricks for Improving Written Communication
Short Course (half day)
Instructor(s): Emily Griffith, North Carolina State University; Julia Sharp, Colorado State University; Zachary Weller, Colorado State University
Effective writing is an essential skill for statistical practitioners, yet it is a skill that is often overlooked in coursework due to the need to stay up to date on the latest statistical methodology. This course will provide participants the opportunity to think critically about the writing process and learn about principles and best practices for statistical writing. Participants will improve their writing skills through participation in writing exercises and will be given the opportunity to receive feedback on their writing. The course will address topics such as organizing and streamlining, reducing clutter, best practices for peer review, and statistical aspects of writing such as alternatives to using the term “statistically significant”. The course will engage participants through discussion and short exercises of editing and reviewing writing samples.
Outline & Objectives
Outline: Introduction (20 min); Introductory Lecture (1 hr): instructors share how they work through the writing process, best practices, and principles of effective writing; Discussion, Questions, and Conversation (15 min); Break (10 min); Mini-lectures with exercises (1 hr 45 min, approximately 25 minutes each): [1] organization and streamlining with exercises in building outlines and telling the story, [2] peer review: what should a peer review look like with exercise in reviewing writing samples and peer review checklist, [3] reducing clutter with exercises in a checklist of steps for reducing clutter and paced, productive, and powerful writing, and [4] statistical aspects of writing such as avoiding “statistically significant” with exercise of rewriting passages; Closing Discussion (15 mins).
Objectives: [1] Improve participants’ confidence and skills in written communication through examples and discussion. [2] Give participants the opportunity to get feedback on their own writing and learn best practices for giving feedback on the writing of others. [3] Provide participants with tips, tricks, and resources for improving their writing and reviewing skills.
About the Instructor
The three instructors (Dr. Zach Weller, Dr. Julia Sharp, Dr. Emily Griffith) for this course have extensive statistical collaboration expertise and PhDs in Statistics. All three instructors have successfully published and peer-reviewed numerous papers in both statistics and applied science journals. The instructors have also been involved in grant writing as both principal investigators and collaborating statisticians.
Relevance to Conference Goals
This short course will increase participants’ confidence in written communication by providing them resources and feedback on the writing and review process. The course will improve participants' skills through short lectures on writing topics followed by exercises and discussion.
Tue, Feb 1
10:00 AM - 1:30 PM
Virtual
SC05 - Using Design of Experiments (DOE) in Industry
Short Course (half day)
Instructor(s): Theodoro Koulis, Genentech; Tony Pourmohamad, Genentech
Design of experiments (DOE) remains the gold standard for the design and development of industrial applications. DOEs can increase efficiencies and provide valuable experimental information that may be used to improve industrial processes. Despite its valuable contributions to various industries, there are a lot of misconceptions of DOE. This course is geared towards applied practitioners who may not be aware of the strengths and benefits of factorial designs. The course includes real datasets and examples from the biotechnology industry. Course participants will be able to use the lessons learned in order to design more efficient experiments in their own domains.
Outline & Objectives
Outline: The course covers fundamental design concepts and presents a simple approach to the design and analysis of multi-factor screening designs. Participants will learn how to design, conduct and analyze multi-factor experiments. No prior statistical training is assumed.
Objectives: The course will cover the following topics
- One-at-a-time vs multi-factor experiments
- Feasible space, design space and center points
- Factorial, fractional factorial, Plackett-Burman designs, and projectability
Participants will be able to design their own multifactor experiments, and will be able to analyze the data using simple techniques. The course will use the JMP Statistical software. Course participants will be able to use the 30-day free trial version of JMP.
About the Instructor
Theo Koulis obtained his PhD in Statistics from the University of Waterloo in Canada. His professional interests include: computational statistics, design of experiments, and statistical consulting. Theo is a Senior Statistician in Nonclinical Biostatistics at Genentech, Inc. supporting CMC (chemistry manufacturing and control) statistics activities. For over 7 years, Theo has supported manufacturing development at Genentech and has gained practical experience designing and implementing experiments in the biotechnology industry. Once a quarter, Theo teaches a Design of Experiments course that is geared towards specific needs of scientists and engineers working in the biotechnology industry.
Relevance to Conference Goals
The course is designed with the applied statistical practitioner in mind. The course will use real world data and examples in order to showcase the benefits of using DOEs in industry. Although the data generated from DOEs can be analyzed using simple techniques, the designed experiments can be used to generate rich and informative datasets. In addition, the course will showcase the JMP DOE toolset, which facilitates the design and analysis of DOEs.
Tue, Feb 1
2:00 PM - 5:30 PM
Virtual
SC06 - Equity and Bias in Algorithms: A Discussion of the Landscape and Techniques for Practitioners
Short Course (half day)
Instructor(s): Emily Hadley, RTI International Center for Data Science
With the growing use of algorithms in many domains, considerations of algorithmic bias and equity have far-reaching implications for society. A developing body of literature highlights the negative impact that biased algorithms can have on individual lives, while new resources provide opportunities for practicing statisticians and data scientists to better incorporate equity into our own work.
In this course, we review the landscape of equity and bias in algorithms. We take a deep dive into specific decision points related to bias and equity throughout the algorithm process, including problem framing, collecting data, completing analyses, and detecting and mitigating bias, and we discuss specific techniques that statisticians and data scientists can use to address these challenges. Attendees will evaluate tools and approaches relevant to their own work. Group discussion is a key component of this course.
Outline & Objectives
About the Instructor
Emily Hadley is a Research Data Scientist with the RTI International Center for Data Science. Her work spans several practice areas including health, education, social policy, and criminal justice. She has experience with machine learning, natural language processing, agent-based modeling, and predictive analytics, with a strong interest in antiracism, bias, and equity in data science. Emily holds a Bachelor of Science in Statistics with a second major in Public Policy Studies from Duke and a Master of Science in Analytics from NC State.
Relevance to Conference Goals
Tue, Feb 1
2:00 PM - 5:30 PM
Virtual
SC07 - Regression-Style Modeling with Variable Selection and Reduction
Short Course (half day)
Instructor(s): Clay Barker, SAS Institute / JMP Division; Ruth Hummel, SAS Institute / JMP Division
Variable Selection is a crucial step in the model building process, whether we are building a predictive model or trying to understand the results of a designed experiment. Generalized Regression modeling provides a single framework for doing interactive variable selection and fitting generalized linear models. This workshop will start with a brief overview of the generalized linear model for modeling responses that are not necessarily normally distributed. We will also introduce variable selection techniques, including stepwise methods like Forward Selection and penalized regression methods like the Lasso. We close the workshop with examples featuring both observational and experimental data and a variety of response types.
Outline & Objectives
About the Instructor
Dr. Clay Barker is a Senior Research Statistician Developer with JMP (a division of SAS) on a variety of statistical platforms in JMP, including Generalized Regression, Fit Curve and Clustering. He earned his doctorate in statistics from North Carolina State University. He holds several patents, including one for his work on implementing new visualizations for interactive model building in generalized regression.
Dr. Ruth Hummel is an Academic Ambassador with JMP (a division of SAS), supporting the technical needs of professors and instructors who use JMP for teaching and research. Dr. Hummel is a coauthor of Business Statistics and Analytics in Practice, 9th edition (2018), and has been teaching and consulting about statistics and analytics for over a decade, at the University of Florida, at the US Environmental Protection Agency, and now at SAS/JMP. She has a PhD in Statistics from The Pennsylvania State University.
Relevance to Conference Goals
Wednesday, February 2
Wed, Feb 2
10:00 AM - 11:00 AM
Virtual
GS1 - Keynote Address
General Session
Wed, Feb 2
10:00 AM - 5:30 PM
Virtual
Exhibits Open
Exhibits
Wed, Feb 2
11:00 AM - 12:30 PM
Virtual
CS01 - Individual Development
Concurrent Session
Chair(s): Margaret Betz, Purdue University
Wed, Feb 2
11:00 AM - 12:30 PM
Virtual
CS02 - Study Effectiveness
Concurrent Session
Chair(s): C. Christina Mehta, Emory University
Wed, Feb 2
11:00 AM - 12:30 PM
Virtual
CS03 - Mixed Models
Concurrent Session
Chair(s): Mahbubul Hasan, The Learner Data Institiute
Wed, Feb 2
11:00 AM - 12:30 PM
Virtual
CS04 - Communication with Nonstatisticians
Concurrent Session
Chair(s): Julia Sharp, Colorado State University
Wed, Feb 2
11:00 AM - 12:30 PM
Virtual
CS05 - WITHDRAWN: Mid-Career Assessment
Concurrent Session
Chair(s): Allison Florance, Novartis
Wed, Feb 2
12:30 PM - 1:30 PM
Virtual
PS1 - Poster Session 1
Poster Session
1
2
3
4
5
6
7
8
9
10
12
13
14
Wed, Feb 2
1:30 PM - 3:00 PM
Virtual
CS06 - Cost-Efficient Design
Concurrent Session
Chair(s): Rachel S Rogers, GlaxoSmithKline
Wed, Feb 2
1:30 PM - 3:00 PM
Virtual
CS07 - Longitudinal Analysis
Concurrent Session
Chair(s): Li-Hsiang Lin, Louisiana State University
Wed, Feb 2
1:30 PM - 3:00 PM
Virtual
CS08 - Communication in the Workplace
Concurrent Session
Chair(s): Mike Jadoo, BLS
Wed, Feb 2
1:30 PM - 3:00 PM
Virtual
CS09 - Business Leadership
Concurrent Session
Chair(s): Lester Kirchner, Geisinger
Wed, Feb 2
1:30 PM - 3:00 PM
Virtual
CS10 - Study Design with External Information
Concurrent Session
Chair(s): Julian David Chan, Weber State University
Wed, Feb 2
3:00 PM - 4:00 PM
Virtual
PS2 - Poster Session 2
Poster Session
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Wed, Feb 2
4:00 PM - 5:30 PM
Virtual
CS12 - Visual Communication
Concurrent Session
Chair(s): Allison Florance, Novartis
Wed, Feb 2
4:00 PM - 5:30 PM
Virtual
CS13 - Professional Skill Set for Statistics
Concurrent Session
Chair(s): Luis Miguel Mestre, Indiana University
Wed, Feb 2
4:00 PM - 5:30 PM
Virtual
CS14 - Design for Forecasting and Projection
Concurrent Session
Chair(s): Mike Jadoo, BLS
Wed, Feb 2
4:00 PM - 5:30 PM
Virtual
CS15 - Generalizability in Biostatistics
Concurrent Session
Chair(s): Seema Sangari, Kennesaw State University
Thursday, February 3
Thu, Feb 3
9:00 AM - 11:00 AM
Virtual
PCD1 - Quantifying Data Disclosure Risk with the R Package SDCNway
Practical Computing Demo
Instructor(s): Tom Krenzke, Westat; Jianzhu Li, Westat
In the past decades, federal agencies and other data collectors have devoted enormous effort to the protection of data confidentiality to ensure that the data products they prepare to release do not enable the identification of individuals or entities, given the information that has been previously released to public. Evaluating and controlling the disclosure risk becomes an indispensable and crucial step in the process of disseminating statistical products or results. In this Practical Computing Demonstration, we lay out a general process for data disclosure risk assessment and estimation, and review several ways to quantify risk at record level or at the file level. The theories under which the approaches have been developed are based on specific models and assumptions. However, these assumptions may not be well respected when applying the approaches to survey data, and they also have limitations in their coverage of risk. This demonstration will provide the participants the tools to conduct a risk assessment. We will discuss practical issues one may encounter in risk estimation and provide guidance and insights on how to set up risk assessments and make decisions in this process. The R program SDCNway will be demonstrated through exercises with survey data. Knowledge of R is not necessary to participate, however, those familiar with R will have the opportunity to use the tool in some exercises.
Outline & Objectives
The objective is to provide the basis for participants to have the ability to use a practical tool for an area of growing importance – data disclosure risk estimation. The outline for the demonstration is as follows:
• Overview and background of confidentiality protection
o General process of risk assessment
o Theories on quantifying risk
o Informed data treatments under statistical confidentiality protection
o Types of disclosure
o Risk estimation focuses on identity and attribute disclosure
o Intruder attacks
o Database reconstruction theorem
• General process of risk assessment
o Example of risk estimation process
• Risk quantification
o Matching to external data
o Use of risk metrics
? File and individual risk using risk metrics
• Log-linear modeling – Skinner and Shlomo (2008)
• Exhaustive tabulations
? Caveats
o Relevance to differential privacy
• Practical Issues from case study examples
o How do we select variables?
o Is only checking indirect identifiers enough?
o Several other issues and topics will be discussed
• Software
o Probabilistic record linkage
o Re-identification risk estimation
? R Package SDCNway
• Demonstration and exercises
• Summary
About the Instructor
Jianzhu (Jane) Li is a senior statistician with over 15 years of experience in survey research and statistical confidentiality. Before joining the Westat staff, she was a research assistant in the Joint Program in Survey Methodology at the University of Maryland and completed internships at NCHS and the National Cancer Institute. Her dissertation research focused on the adaptation of diagnostics for linear models to make them appropriate for survey data. Dr. Li has experience in many aspects of survey research, including sample design, nonresponse adjustment, imputation, variance estimation, and data confidentiality and disclosure protection.
Tom Krenzke is a Vice President and Associate Director in Westat’s Statistics and Evaluation Sciences Unit, and has about 30 years of experience in statistical confidentiality, survey sampling and estimation techniques. Mr. Krenzke adds new statistical capabilities by developing software for statistical disclosure control; nonresponse bias analysis; area sampling, and imputation. Mr. Krenzke is a Fellow of the American Statistical Association (ASA), District 2’s vice chair on ASA’s Council of Chapters, and was the 2020 chair of ASA’s Committee on Privacy and Confidentiality.
Relevance to Conference Goals
This practical computing demonstration will provide participants with opportunities to learn new statistical methodologies and best practices relating to data disclosure risk assessment. The demonstration will help statisticians apply the SDCNway tool to their jobs, learn how to communicate the results and its caveats and limitations. The quantification of risk will provide data driven evidence versus relying on judgement alone, therefore having a positive impact on data dissemination process, and giving a better balance between disclosure risk and data utility.
Thu, Feb 3
9:00 AM - 11:00 AM
Virtual
PCD2 - Survival Analysis Using Stata
Practical Computing Demo
Instructor(s): Chuck Huber, StataCorp LLC
The first half of this workshop introduces the concepts and jargon of survival analysis including time-to-event data, different kinds of censoring, as well as graphical, nonparametric, semi-parametric, and parametric methods for modeling survival data. I then demonstrate how to use Stata's stset command to tell Stata about the features of a survival dataset, how to use Stata's st commands to fit models for survival data, and how to use margins, marginsplot, and stcurve to visualize the results of these models. The second half of this workshop includes advanced topics such as Cox regression with categorical and continuous time-varying covariates, Cox regression for interval-censored data, and shared-frailty models.
Outline & Objectives
3.1.Introduction to Survival Data and Censoring using stset
3.2.Nonparametric Estimation
3.2.1. Incidence Rates using stir
3.2.2. Life Tables using ltable
3.2.3. Kaplan-Meier Graphs/Tables using sts graph and sts list
3.2.4. Log Rank Test using sts test
3.3.Semi-parametric Estimation
3.3.1.1. Cox Proportional-Hazards Regression using stcox
3.4.Parametric Estimation
3.4.1. Exponential, Weibull, and Gamma models using streg
3.5.Cox Regression with Time Varying Covariates
3.5.1. Categorical Time Varying Covariates using stcox
3.5.2. Continuous Time Varying Covariates using stcox, tvc()
3.6.Cox Regression for Interval-Censored Data using stintcox
3.7.Shared Frailty Models
3.7.1. Cox Proportional Hazards Model using stcox, shared()
3.7.2. Parametric Regression Models using streg, shared() frailty()
About the Instructor
Chuck Huber is Director of Statistical Outreach at StataCorp and Adjunct Associate Professor of Biostatistics at both the Texas A&M School of Public Health and the New York University
School of Global Public Health. He produces instructional videos for the Stata YouTube channel, writes blog posts, develops online NetCourses and gives talks about Stata at conferences and universities around the world. Most of his current work is focused on statistical methods used by behavioral and health scientists. He has published in the areas of neurology, human and animal genetics, alcohol and drug abuse prevention, nutrition, and birth defects. Dr. Huber currently teaches survey sampling at New York University and introductory biostatistics at Texas A&M where he previously taught categorical data analysis, survey sampling, and statistical genetics.
Relevance to Conference Goals
This workshop will provide an applied introduction to survival analysis for statisticians who are new to the topic. We will introduce the study designs, data management tasks, implementation, and analysis strategies that are unique to survival analysis. And we will show how to produce tables and graphs to communicate the results.
Thu, Feb 3
9:00 AM - 11:00 AM
Virtual
PCD3 - Methods and Applications of Finite Mixture Models, with Computing Demonstrations Using the R Package ‘mixtools’
Practical Computing Demo
Instructor(s): Derek S. Young, University of Kentucky, Department of Statistics
Finite mixture models are used to model data where the observations are sampled from a population that consists of several homogeneous subpopulations, often called the components of the population, but to which subpopulation each observation belongs is unknown. Thus, estimation of mixture components is considered an unsupervised learning task. Practically, speaking, we make a soft probabilistic classification of each observation to a component, whereas cluster analysis performs a hard classification to a cluster, so finite mixture models are, naturally used as the underlying models for model-based clustering. They can also be used for density estimation and viewed as a kind of kernel methods.
Applications that use finite mixture models are found in nearly every field. Of course, the availability of computational tools and resources are crucial to doing analysis with mixture models. The R package ‘mixtools’ is a leading software package in this respect. This highly-cited package (1100+ citations) has been used to analyze diverse research questions involving quasars data, maize production, clustering of patients with leukemia, subpopulations of individuals with autism and schizophrenia, and certain color variants on manta rays
Outline & Objectives
This PCD will address topics of finite mixture modeling relevant to practitioners. It will be specifically designed to engage researchers and practitioners from industry and government. Each topic presented will be highlighted by relevant examples, as well as a computing demonstration using the instructor’s R package ‘mixtools.’ Attendees will be encouraged to download the package and come prepared to work through some real data exercises. The outline of the PCD is as follows:
0. Intro to finite mixture models and the ‘mixtools’ package
1. Gaussian mixture models
2. Parametric and semiparametric ixtures of regressions
3. Other parametric mixture models
4. Determining the number of components
5. Visualizing estimated mixture models
6. Open discussion: What is needed?
Relevant examples that will be presented include the following: analysis of quasars data, modeling biomarkers for Diffuse Large B-Cell Lymphoma (DLBCL), an analysis of propagation rate of viral infection of potato plants, and identifying subgroups of strategies in a psychological task
About the Instructor
The instructor received their PhD in Statistics from Penn State University in 2007, where his research focused on computational aspects of novel finite mixture models. He subsequently worked as a Senior Statistician for the Naval Nuclear Propulsion Program (Bettis Lab) for 3.5 years and then as a Research Mathematical Statistician for the US Census Bureau for 3 years. He then joined the faculty of the Department of Statistics at the University of Kentucky in the fall of 2014, where they are currently a tenured Associate Professor.
While at the Bettis Lab, the instructor engaged with engineers and nuclear regulators, often regarding the calculation of tolerance intervals. While at the Census Bureau, the instructor wrote several methodological and computational papers, many as the sole author. Since being at the University of Kentucky, the instructor has further progressed his research agenda in finite mixture modeling, zero-inflated modeling, and tolerance regions. The instructor has also received a highly-competitive grant from the Chan Zuckerberg Initiative for the 2021 calendar year.
The instructor has extensive teaching experience at all levels of education, including continuing education courses (which were taught at the Bettis Lab). He was a Lecturer (remotely) for Penn State University from Spring 2008 to Fall 2013, where they taught a masters-level regression methods course.
Relevance to Conference Goals
This PCD will primarily relate to goals in Theme 3, with some secondary components related to Theme 4. The content of this PCD will inform the attendees as to best practices when using finite mixture models. This will include, but is not limited to, formulating hypotheses about the presence of subpopulations in the application at hand, understanding at a high level when mixture models should be used, gaining insight into the algorithms used for performing maximum likelihood estimation, understanding limitations of estimating mixture models, and gaining proficiency with tools available in the ‘mixtools’ package. Improved data visualizations are currently being developed for and included in the ‘mixtools package, so users will be exposed to usage of state-of-the-art visualizations for the package. Interpretations of the estimated mixture models presented in this PCS will be thoroughly emphasized as the practitioners in attendance will almost certainly have to communicate their estimated models to non-statisticians.
Thu, Feb 3
9:00 AM - 11:00 AM
Virtual
PCD4 - State-of-the-Art with Statistical Tolerance Regions: Methods and Applications, with Computing Demonstrations Using the R Package ‘tolerance’
Practical Computing Demo
Instructor(s): Kedai Cheng, UNC-Asheville
Statistical tolerance intervals of the form (1-a, P) provide bounds to capture at least a specified proportion P of the sampled population with a given confidence level 1-a. The quantity P is called the content of the tolerance interval and the confidence level 1-a reflects the sampling variability. Statistical tolerance intervals are ubiquitous in regulatory documents, especially regarding design verification and process validation. Examples of such regulations are those published by the Food and Drug Administration (FDA), the Environmental Protection Agency (EPA), the International Atomic Energy Agency (IAEA), and the standard 16269-6 of the International Organization for Standardization (ISO). Research and development in the area of statistical tolerance intervals has undoubtedly been guided by the needs and demands of industry experts.
Some of the most germane biopharmaceutical applications of tolerance intervals include their use in quality control of drug products, setting process validation acceptance criteria, establishing sample sizes for process validation, and assessing biosimilarity. While tolerance intervals are available for numerous parametric distributions, procedures are available for regression models, mixed-effects models, and multivariate settings (i.e., tolerance regions). Nonparametric procedures are also available and commonly employed.
Outline & Objectives
This PCD will address topics of statistical tolerance intervals relevant to industry regulators and occasional users. It will be specifically designed to engage researchers and practitioners from industry and government. Each topic presented will be highlighted by relevant examples, as well as a computing demonstration using the instructor’s R package ‘tolerance.’ Attendees will be encouraged to download the package and come prepared to work through some real data exercises. The outline of the PCD is as follows:
0. Intro to tolerance intervals and the ‘tolerance’ package
1. Normal tolerance intervals
2. Nonparametric tolerance intervals
3. Some non-normal tolerance intervals
4. Regression tolerance intervals
5. Multivariate tolerance regions
6. Open discussion: What is needed?
Relevant examples that will be presented include the following: quality control, medical device validation, assessing biosimilarity, hospital capacity planning, establishing reference regions in laboratory medicine, and cancer data.
About the Instructor
The authors of this PCD are Dr. Derek Young (University of Kentucky) and Dr. Kedai Cheng (University of North Carolina – Asheville). Dr. Young received their PhD in Statistics from Penn State University in 2007, where his research focused on computational aspects of novel finite mixture models. He subsequently worked as a Senior Statistician for the Naval Nuclear Propulsion Program (Bettis Lab) for 3.5 years, where he engaged with engineers and nuclear regulators, often regarding the calculation of tolerance intervals. He then worked as a Research Mathematical Statistician for the US Census Bureau for 3 years. Most recently, he joined the faculty of the Department of Statistics at the University of Kentucky in the fall of 2014, where he is currently a tenured Associate Professor.
Dr. Kedai Cheng will be the instructor for this training. Dr. Cheng received his PhD in Statistics from the University of Kentucky in the summer of 2020. He is currently an Assistant Professor of Statistics at the University of North Carolina – Asheville. His main research interests lie at the intersection of tolerance regions and time series data. He has also been instrumental in advancing the graphics capabilities of the ‘tolerance’ package.
Relevance to Conference Goals
This PCD will primarily relate to goals in Theme 3, with some secondary components related to Theme 4. The content of this PCD will inform the attendees as to best practices when using statistical tolerance regions. This will include, but is not limited to, understanding when tolerance regions should be used, how to interpret tolerance regions, what methods are available for various data settings, and what computational tools are available in the ‘tolerance’ package. Improved data visualizations have also been recently included in the ‘tolerance’ package, so users will be exposed to usage of state-of-the-art visualizations that include estimated tolerance regions. Practitioners stand the gain the most from this PCD as they frequently need to use statistical tolerance regions to show compliance of their process with certain standards or regulatory specifications. The interpretations of the tolerance regions, both quantitatively and visually, will be thoroughly emphasized as the practitioners in attendance will almost certainly have to communicate these results to non-statisticians.
Thu, Feb 3
9:00 AM - 11:00 AM
Virtual
T01 - Network Analysis to Solve Business Problems
Tutorial
Instructor(s): Carlos Pinheiro, SAS Institute
Network analysis includes graph theory algorithms that can augment data mining and machine learning. In many practical applications, pairwise interaction between the entities of interest in the model often plays an important role. Network analysis goes beyond traditional clustering and predictive models to identify patterns in business data, including entities’ behavior based on their relationships. Network analysis can be employed to avoid churn, diffuse products and services, detect fraud and abuse, identify anomalies, and many other applications, in a wide range of industries such as communications and media, banking, insurance, retail, utilities, and travel and transportation.
Outline & Objectives
Section 1: Fundamental Concepts in Network Analysis
Introduction
Concepts about network analysis
The type of data for network building and network analysis
Section 2: Sub-Networks
Connected components
Bi-connected components
Community detection
Reach
Core
Section 3: Centrality Measures
Degree
Influence
Clustering coefficient
Closeness
Betweenness
Hub
Authority
Eigenvector
PageRank
About the Instructor
Carlos Pinheiro is a Principal Data Scientist at SAS, US and a Visiting Professor at Data ScienceTech Institute, France. He led analytical teams at Embratel, Brasil Telecom and Oi, worked as a Senior Data Scientist for EMC on network analytics, optimization, and text analytics projects, and as a Lead Data Scientist for Teradata on machine learning projects. Dr. Pinheiro has a BSc in Mathematics and Computer Science, a MSc in Computing, and a DSc in Engineering from Federal University of Rio de Janeiro. He has completed a series of postdoctoral research terms in different fields, including Dynamic Systems at IMPA, Brazil, Social Network Analysis at Dublin City University, Ireland, Transportation Systems at Université de Savoie, France, Dynamic Social Networks and Human Mobility at KU Leuven, Belgium, and Urban Mobility and Multi-modal Traffic at Fundação Getúlio Vargas, Brazil. He is author of Social Network Analysis in Telecommunications and Heuristics in Analytics: A Practical Perspective of What Influence Our Analytical World, both published by John Wiley Sons, Inc, and Introduction to Statistical and Machine Learning Methods for Data Science, SAS Press.
Relevance to Conference Goals
This tutorial provides a practical perspective about how to use network analysis to solve real-world problems, including hands-on demonstrations on the algorithms and case studies. The course focuses on the data science approach to solve business problems, combining different techniques to evaluate the problem and propose optimal solutions.
Thu, Feb 3
9:00 AM - 11:00 AM
Virtual
T02 - Fundamentals of Study Design and Analysis Plans for Biomarker Research
Tutorial
Instructor(s): Douglas Landsittel, Indiana University, Bloomington
Biomarker studies are ubiquitous and critically significant across almost every discipline and stage of research. Biomarkers are often necessary for diagnosis of disease status, prognosis for future outcomes, and prediction of treatment response. They are also critical for understanding mechanisms of disease, monitoring disease progression, and identifying high risk populations most likely to benefit from interventions and medical treatments. In addition, surrogate biomarkers are often necessary when the true clinical endpoint is either impractical to measure or develops too slowly for effective intervention. However, despite the critical role of biomarker studies, many key aspects of their study designs and statistical analysis plans (SAPs) are poorly understood, thus leading to suboptimal funding proposals and poorly designed study protocols. This workshop describes the necessary concepts and key steps in study design and SAPs for biomarker research. Those concepts and specific steps are illustrated through description of challenges and approaches for two ongoing multi-site studies in polycystic kidney disease and severe acute respiratory infections (including COVID-19).
Outline & Objectives
The objectives of this half-day workshop are to describe, and illustrate examples of, best practices in biomarker study design and statistical analysis planning. Participants will gain skills to effectively design and write analysis plans for proposals and study protocols.
The content assumes only a basic knowledge of regression and study design (e.g. introductory biostatistics or epidemiology course).
Workshop Topics:
Part I: Introduction:
1) Definitions and applications
2) Biomarker panels, signatures and other high dimensional data
3) Case studies in polycystic kidney disease and COVID-19
Part II: Overview of Regression for Biomarker Analysis:
1) Goals of regression
2) Models for classification and prediction
3) Evaluation of accuracy
Part III: Types of Biomarkers and Associated Statistics:
1) Why it matters
2) Differential expression, correlation, diagnosis, prognosis, and response prediction
3) Surrogate markers and clinical endpoints
Part IV: Study Designs and Phases of Biomarker Research
1) Subject selection, timing of measurements, and randomization
2) Classifying phases of biomarker development
3) The need for multiple studies.
Conclusions
About the Instructor
Dr. Landsittel is the Professor and Chair of Epidemiology and Biostatistics in the School of Public Health at Indiana University Bloomington. He has published nearly 150 peer-reviewed papers, many of which are focused on biomarker studies. Previously, he has served as the Associate Director of the Biostatistics Facility for the Hillman Cancer Center, Associate Director of the Center for Research on Healthcare Data Center, and Director of Biostatistics (for Research) for the Starzl Transplant Institute. He has also been appointed to study sections, and is the Chair of the Safety and Occupational Health Study Section, and has served on numerous other biomarker-related expert panels.
Relevance to Conference Goals
The Conference on Statistical Practice seeks to engage “statistical practitioners and data scientists” in real-world problems, including in the area of study design. This proposal addresses study design challenges in one of the most critical and common areas across nearly disciplines: biomarker research.
Thu, Feb 3
9:30 AM - 3:30 PM
Virtual
Exhibits Open
Exhibits
Thu, Feb 3
11:00 AM - 12:30 PM
Virtual
CS16 - Teaching Effective Communication
Concurrent Session
Chair(s): Svetlana Ekisheva, NERC
Thu, Feb 3
11:00 AM - 12:30 PM
Virtual
CS17 - Ethics Panel
Concurrent Session
Chair(s): David J Corliss, Peace-Work
Thu, Feb 3
11:00 AM - 12:30 PM
Virtual
CS18 - At the Intersection of Statistics and Software
Concurrent Session
Chair(s): Margaret Betz, Purdue University
Thu, Feb 3
11:00 AM - 12:30 PM
Virtual
CS19 - Maximizing Model Learnings
Concurrent Session
Chair(s): Laura Kahn, Booz Allen Hamilton
Thu, Feb 3
11:00 AM - 12:30 PM
Virtual
CS20 - Power Skills Panel
Concurrent Session
Chair(s): Emily Griffith, North Carolina State University
Thu, Feb 3
12:30 PM - 1:30 PM
Virtual
PS3 - Poster Session 3
Poster Session
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Thu, Feb 3
1:30 PM - 3:30 PM
Virtual
CS21 - Professional Practices
Concurrent Session
Chair(s): Margaret Betz, Purdue University
Thu, Feb 3
1:30 PM - 3:30 PM
Virtual
CS22 - Dependent Proportions and Causal Inference
Concurrent Session
Chair(s): Allison Florance, Novartis
Thu, Feb 3
1:30 PM - 3:30 PM
Virtual
CS24 - Communication to Empower the Audience
Concurrent Session
Chair(s): Heather Kitada Smalley, Willamette University
Thu, Feb 3
3:30 PM - 4:45 PM
Virtual
GS2 - Closing General Session
General Session
The closing session is an opportunity for you to interact with the CSP Steering Committee in an open discussion about how the conference went and how it could be improved in future years. CSP Steering Committee vice chair, Mac Turner, will lead a panel of committee members as they summarize their conference experience. The audience will then be invited to ask questions and provide feedback. The committee highly values suggestions for improvements gathered during this time.