All Times ET
Viewing session type: Practical Computing Demo
Back to search menu
Thursday, February 3
Thu, Feb 3
9:00 AM - 11:00 AM
Virtual
PCD1 - Quantifying Data Disclosure Risk with the R Package SDCNway
Practical Computing Demo
Instructor(s): Tom Krenzke, Westat; Jianzhu Li, Westat
In the past decades, federal agencies and other data collectors have devoted enormous effort to the protection of data confidentiality to ensure that the data products they prepare to release do not enable the identification of individuals or entities, given the information that has been previously released to public. Evaluating and controlling the disclosure risk becomes an indispensable and crucial step in the process of disseminating statistical products or results. In this Practical Computing Demonstration, we lay out a general process for data disclosure risk assessment and estimation, and review several ways to quantify risk at record level or at the file level. The theories under which the approaches have been developed are based on specific models and assumptions. However, these assumptions may not be well respected when applying the approaches to survey data, and they also have limitations in their coverage of risk. This demonstration will provide the participants the tools to conduct a risk assessment. We will discuss practical issues one may encounter in risk estimation and provide guidance and insights on how to set up risk assessments and make decisions in this process. The R program SDCNway will be demonstrated through exercises with survey data. Knowledge of R is not necessary to participate, however, those familiar with R will have the opportunity to use the tool in some exercises.
Outline & Objectives
The objective is to provide the basis for participants to have the ability to use a practical tool for an area of growing importance – data disclosure risk estimation. The outline for the demonstration is as follows:
• Overview and background of confidentiality protection
o General process of risk assessment
o Theories on quantifying risk
o Informed data treatments under statistical confidentiality protection
o Types of disclosure
o Risk estimation focuses on identity and attribute disclosure
o Intruder attacks
o Database reconstruction theorem
• General process of risk assessment
o Example of risk estimation process
• Risk quantification
o Matching to external data
o Use of risk metrics
? File and individual risk using risk metrics
• Log-linear modeling – Skinner and Shlomo (2008)
• Exhaustive tabulations
? Caveats
o Relevance to differential privacy
• Practical Issues from case study examples
o How do we select variables?
o Is only checking indirect identifiers enough?
o Several other issues and topics will be discussed
• Software
o Probabilistic record linkage
o Re-identification risk estimation
? R Package SDCNway
• Demonstration and exercises
• Summary
About the Instructor
Jianzhu (Jane) Li is a senior statistician with over 15 years of experience in survey research and statistical confidentiality. Before joining the Westat staff, she was a research assistant in the Joint Program in Survey Methodology at the University of Maryland and completed internships at NCHS and the National Cancer Institute. Her dissertation research focused on the adaptation of diagnostics for linear models to make them appropriate for survey data. Dr. Li has experience in many aspects of survey research, including sample design, nonresponse adjustment, imputation, variance estimation, and data confidentiality and disclosure protection.
Tom Krenzke is a Vice President and Associate Director in Westat’s Statistics and Evaluation Sciences Unit, and has about 30 years of experience in statistical confidentiality, survey sampling and estimation techniques. Mr. Krenzke adds new statistical capabilities by developing software for statistical disclosure control; nonresponse bias analysis; area sampling, and imputation. Mr. Krenzke is a Fellow of the American Statistical Association (ASA), District 2’s vice chair on ASA’s Council of Chapters, and was the 2020 chair of ASA’s Committee on Privacy and Confidentiality.
Relevance to Conference Goals
This practical computing demonstration will provide participants with opportunities to learn new statistical methodologies and best practices relating to data disclosure risk assessment. The demonstration will help statisticians apply the SDCNway tool to their jobs, learn how to communicate the results and its caveats and limitations. The quantification of risk will provide data driven evidence versus relying on judgement alone, therefore having a positive impact on data dissemination process, and giving a better balance between disclosure risk and data utility.
Thu, Feb 3
9:00 AM - 11:00 AM
Virtual
PCD2 - Survival Analysis Using Stata
Practical Computing Demo
Instructor(s): Chuck Huber, StataCorp LLC
The first half of this workshop introduces the concepts and jargon of survival analysis including time-to-event data, different kinds of censoring, as well as graphical, nonparametric, semi-parametric, and parametric methods for modeling survival data. I then demonstrate how to use Stata's stset command to tell Stata about the features of a survival dataset, how to use Stata's st commands to fit models for survival data, and how to use margins, marginsplot, and stcurve to visualize the results of these models. The second half of this workshop includes advanced topics such as Cox regression with categorical and continuous time-varying covariates, Cox regression for interval-censored data, and shared-frailty models.
Outline & Objectives
3.1.Introduction to Survival Data and Censoring using stset
3.2.Nonparametric Estimation
3.2.1. Incidence Rates using stir
3.2.2. Life Tables using ltable
3.2.3. Kaplan-Meier Graphs/Tables using sts graph and sts list
3.2.4. Log Rank Test using sts test
3.3.Semi-parametric Estimation
3.3.1.1. Cox Proportional-Hazards Regression using stcox
3.4.Parametric Estimation
3.4.1. Exponential, Weibull, and Gamma models using streg
3.5.Cox Regression with Time Varying Covariates
3.5.1. Categorical Time Varying Covariates using stcox
3.5.2. Continuous Time Varying Covariates using stcox, tvc()
3.6.Cox Regression for Interval-Censored Data using stintcox
3.7.Shared Frailty Models
3.7.1. Cox Proportional Hazards Model using stcox, shared()
3.7.2. Parametric Regression Models using streg, shared() frailty()
About the Instructor
Chuck Huber is Director of Statistical Outreach at StataCorp and Adjunct Associate Professor of Biostatistics at both the Texas A&M School of Public Health and the New York University
School of Global Public Health. He produces instructional videos for the Stata YouTube channel, writes blog posts, develops online NetCourses and gives talks about Stata at conferences and universities around the world. Most of his current work is focused on statistical methods used by behavioral and health scientists. He has published in the areas of neurology, human and animal genetics, alcohol and drug abuse prevention, nutrition, and birth defects. Dr. Huber currently teaches survey sampling at New York University and introductory biostatistics at Texas A&M where he previously taught categorical data analysis, survey sampling, and statistical genetics.
Relevance to Conference Goals
This workshop will provide an applied introduction to survival analysis for statisticians who are new to the topic. We will introduce the study designs, data management tasks, implementation, and analysis strategies that are unique to survival analysis. And we will show how to produce tables and graphs to communicate the results.
Thu, Feb 3
9:00 AM - 11:00 AM
Virtual
PCD3 - Methods and Applications of Finite Mixture Models, with Computing Demonstrations Using the R Package ‘mixtools’
Practical Computing Demo
Instructor(s): Derek S. Young, University of Kentucky, Department of Statistics
Finite mixture models are used to model data where the observations are sampled from a population that consists of several homogeneous subpopulations, often called the components of the population, but to which subpopulation each observation belongs is unknown. Thus, estimation of mixture components is considered an unsupervised learning task. Practically, speaking, we make a soft probabilistic classification of each observation to a component, whereas cluster analysis performs a hard classification to a cluster, so finite mixture models are, naturally used as the underlying models for model-based clustering. They can also be used for density estimation and viewed as a kind of kernel methods.
Applications that use finite mixture models are found in nearly every field. Of course, the availability of computational tools and resources are crucial to doing analysis with mixture models. The R package ‘mixtools’ is a leading software package in this respect. This highly-cited package (1100+ citations) has been used to analyze diverse research questions involving quasars data, maize production, clustering of patients with leukemia, subpopulations of individuals with autism and schizophrenia, and certain color variants on manta rays
Outline & Objectives
This PCD will address topics of finite mixture modeling relevant to practitioners. It will be specifically designed to engage researchers and practitioners from industry and government. Each topic presented will be highlighted by relevant examples, as well as a computing demonstration using the instructor’s R package ‘mixtools.’ Attendees will be encouraged to download the package and come prepared to work through some real data exercises. The outline of the PCD is as follows:
0. Intro to finite mixture models and the ‘mixtools’ package
1. Gaussian mixture models
2. Parametric and semiparametric ixtures of regressions
3. Other parametric mixture models
4. Determining the number of components
5. Visualizing estimated mixture models
6. Open discussion: What is needed?
Relevant examples that will be presented include the following: analysis of quasars data, modeling biomarkers for Diffuse Large B-Cell Lymphoma (DLBCL), an analysis of propagation rate of viral infection of potato plants, and identifying subgroups of strategies in a psychological task
About the Instructor
The instructor received their PhD in Statistics from Penn State University in 2007, where his research focused on computational aspects of novel finite mixture models. He subsequently worked as a Senior Statistician for the Naval Nuclear Propulsion Program (Bettis Lab) for 3.5 years and then as a Research Mathematical Statistician for the US Census Bureau for 3 years. He then joined the faculty of the Department of Statistics at the University of Kentucky in the fall of 2014, where they are currently a tenured Associate Professor.
While at the Bettis Lab, the instructor engaged with engineers and nuclear regulators, often regarding the calculation of tolerance intervals. While at the Census Bureau, the instructor wrote several methodological and computational papers, many as the sole author. Since being at the University of Kentucky, the instructor has further progressed his research agenda in finite mixture modeling, zero-inflated modeling, and tolerance regions. The instructor has also received a highly-competitive grant from the Chan Zuckerberg Initiative for the 2021 calendar year.
The instructor has extensive teaching experience at all levels of education, including continuing education courses (which were taught at the Bettis Lab). He was a Lecturer (remotely) for Penn State University from Spring 2008 to Fall 2013, where they taught a masters-level regression methods course.
Relevance to Conference Goals
This PCD will primarily relate to goals in Theme 3, with some secondary components related to Theme 4. The content of this PCD will inform the attendees as to best practices when using finite mixture models. This will include, but is not limited to, formulating hypotheses about the presence of subpopulations in the application at hand, understanding at a high level when mixture models should be used, gaining insight into the algorithms used for performing maximum likelihood estimation, understanding limitations of estimating mixture models, and gaining proficiency with tools available in the ‘mixtools’ package. Improved data visualizations are currently being developed for and included in the ‘mixtools package, so users will be exposed to usage of state-of-the-art visualizations for the package. Interpretations of the estimated mixture models presented in this PCS will be thoroughly emphasized as the practitioners in attendance will almost certainly have to communicate their estimated models to non-statisticians.
Thu, Feb 3
9:00 AM - 11:00 AM
Virtual
PCD4 - State-of-the-Art with Statistical Tolerance Regions: Methods and Applications, with Computing Demonstrations Using the R Package ‘tolerance’
Practical Computing Demo
Instructor(s): Kedai Cheng, UNC-Asheville
Statistical tolerance intervals of the form (1-a, P) provide bounds to capture at least a specified proportion P of the sampled population with a given confidence level 1-a. The quantity P is called the content of the tolerance interval and the confidence level 1-a reflects the sampling variability. Statistical tolerance intervals are ubiquitous in regulatory documents, especially regarding design verification and process validation. Examples of such regulations are those published by the Food and Drug Administration (FDA), the Environmental Protection Agency (EPA), the International Atomic Energy Agency (IAEA), and the standard 16269-6 of the International Organization for Standardization (ISO). Research and development in the area of statistical tolerance intervals has undoubtedly been guided by the needs and demands of industry experts.
Some of the most germane biopharmaceutical applications of tolerance intervals include their use in quality control of drug products, setting process validation acceptance criteria, establishing sample sizes for process validation, and assessing biosimilarity. While tolerance intervals are available for numerous parametric distributions, procedures are available for regression models, mixed-effects models, and multivariate settings (i.e., tolerance regions). Nonparametric procedures are also available and commonly employed.
Outline & Objectives
This PCD will address topics of statistical tolerance intervals relevant to industry regulators and occasional users. It will be specifically designed to engage researchers and practitioners from industry and government. Each topic presented will be highlighted by relevant examples, as well as a computing demonstration using the instructor’s R package ‘tolerance.’ Attendees will be encouraged to download the package and come prepared to work through some real data exercises. The outline of the PCD is as follows:
0. Intro to tolerance intervals and the ‘tolerance’ package
1. Normal tolerance intervals
2. Nonparametric tolerance intervals
3. Some non-normal tolerance intervals
4. Regression tolerance intervals
5. Multivariate tolerance regions
6. Open discussion: What is needed?
Relevant examples that will be presented include the following: quality control, medical device validation, assessing biosimilarity, hospital capacity planning, establishing reference regions in laboratory medicine, and cancer data.
About the Instructor
The authors of this PCD are Dr. Derek Young (University of Kentucky) and Dr. Kedai Cheng (University of North Carolina – Asheville). Dr. Young received their PhD in Statistics from Penn State University in 2007, where his research focused on computational aspects of novel finite mixture models. He subsequently worked as a Senior Statistician for the Naval Nuclear Propulsion Program (Bettis Lab) for 3.5 years, where he engaged with engineers and nuclear regulators, often regarding the calculation of tolerance intervals. He then worked as a Research Mathematical Statistician for the US Census Bureau for 3 years. Most recently, he joined the faculty of the Department of Statistics at the University of Kentucky in the fall of 2014, where he is currently a tenured Associate Professor.
Dr. Kedai Cheng will be the instructor for this training. Dr. Cheng received his PhD in Statistics from the University of Kentucky in the summer of 2020. He is currently an Assistant Professor of Statistics at the University of North Carolina – Asheville. His main research interests lie at the intersection of tolerance regions and time series data. He has also been instrumental in advancing the graphics capabilities of the ‘tolerance’ package.
Relevance to Conference Goals
This PCD will primarily relate to goals in Theme 3, with some secondary components related to Theme 4. The content of this PCD will inform the attendees as to best practices when using statistical tolerance regions. This will include, but is not limited to, understanding when tolerance regions should be used, how to interpret tolerance regions, what methods are available for various data settings, and what computational tools are available in the ‘tolerance’ package. Improved data visualizations have also been recently included in the ‘tolerance’ package, so users will be exposed to usage of state-of-the-art visualizations that include estimated tolerance regions. Practitioners stand the gain the most from this PCD as they frequently need to use statistical tolerance regions to show compliance of their process with certain standards or regulatory specifications. The interpretations of the tolerance regions, both quantitatively and visually, will be thoroughly emphasized as the practitioners in attendance will almost certainly have to communicate these results to non-statisticians.