Online Program

Return to main conference page
Keynote Address | Concurrent Sessions | Poster Sessions
Short Courses (full day) | Short Courses (half day) | Tutorials | Practical Computing Demonstrations | Closing General Session with Refreshments

Last Name:

Abstract Keyword:


Thursday, February 15
Thu, Feb 15, 7:00 AM - 6:30 PM

SC1 Introduction to Big Data Analysis
Thu, Feb 15, 8:00 AM - 5:30 PM
Instructor(s): Fulya Gokalp Yavuz, Yildiz Technical University; Mark Daniel Ward, Purdue University
This one-day introductory workshop is geared toward CSP participants who want to revitalize or improve their data analysis skills, especially with an emphasis on big data. Ward and Gokalp Yavuz will present tools and techniques for these most fundamental, low-level aspects of data analysis. We are well-versed at teaching such techniques to students who have no background in data analysis or programming. This workshop will bring people up to speed with powerful techniques for data analysis. This one-day course has no prerequisites. This workshop will be hands-on and driven by examples, using large data sets. The intended participants for the course are people who work in a data-driven environment and have an increasing need to perform aspects of large data analysis. Before data is gathered and organized, a great deal of data manipulation is necessary, especially for working with big data sets. Sometimes the data need to be scraped from remote sources, and then parsed into more natural forms. This process often involves munging and cleaning the data. The need to be able to reproduce and reliably verify all of the methods used for the data wrangling is more important than ever.

Outline & Objectives

R will be the main tool utilized in the workshop. The workshop is geared toward practitioners with (perhaps) only a limited knowledge of R, or even no knowledge of R at all. For instance, someone who has previously used (only) Excel, SAS, or Tableau for data analysis is a perfect candidate for this all-day immersive workshop. We endeavor to use R and its XML scraping and parsing libraries for pulling raw data from disparate sources on the internet, and wrangling them into forms that are amenable for data analysis.

The entire workshop will be example-driven. Participants should bring a laptop computer (Mac, Windows, and UNIX are all welcome). We will work in RStudio. Instructions for installing the necessary software can be sent to the participants before the workshop starts. We will use R Markdown for creating reproducible documents.

By the end of the one-day workshop, participants will have learned how to scrape data sets from the web, parse the desired portions of the data, wrangle it into a desired form for data analysis, and also perform some cleaning and verifying of the data. Reproducible paradigms and reliability will be emphasized throughout the workshop.

About the Instructor

Dr. Mark Daniel Ward is an Associate Professor of Statistics at Purdue University. Ward has years of experience teaching fundamental data analysis techniques to students who often have no previous experience with such tools. He emphasizes new computational tools, including R, data visualization, UNIX, bash shell scripting, regular expressions, SQL, XML, etc. Ward firmly believes in team-oriented environments for learning data analysis. He coordinates the Statistics Living Learning Community at Purdue, a $1.5 million NSF grant in which students are immersed in a year-long data analysis environment that blends the undergraduate Statistics coursework with research opportunities, professional development, extracurricular data analysis activities, etc.

Dr. Fulya Gokalp Yavuz is currently a Post-Doc in the Department of Statistics, Purdue University. She has been working with Dr. Ward since July 2016 on training in Data Science and she is enthusiastic on teaching new data science topics with new methods. She has teaching experience on statistical topics such as Multivariate Statistics and Statistical Programs such as R.

Relevance to Conference Goals

This workshop fits squarely within Theme 3 of the CSP workshop, namely, "Big Data and Data Science". The workshop should, according to CSP's description, "help practitioners working in these fields stay current with state-of-the-art methods". The workshop should be especially appealing to people who yearn to move into more data-oriented tasks in the workplace, but who have not (yet) moved beyond traditional spreadsheet or database tools for data analysis.

The workshop will have a learning-by-doing methodology, in which the participants will be actively learning, rather than listening to lectures.

By understanding the fundamental tools for data analysis, the participants will be better enabled to move onwards to statistical methods after having learned a great deal about the computational resources that are needed for reproducible data wrangling at the earliest stages of the data analysis cycle.

Software Packages

R, RStudio, XML libraries, R Markdown. Ward and Gokalp Yavuz will provide computational resources for the participants to use. If participants bring a laptop computer, they can utilize our computational environment within a web browser. There is no need to install any software before the workshop, and no previous background is required.

SC2 An Introduction to d3.js: From Scattered to Scatterplot
Thu, Feb 15, 8:00 AM - 5:30 PM
Instructor(s): Scott Murray, O’Reilly Media
Interested in coding data visualizations on the web, but don't know where to start? This workshop will have you transforming data into visual images in no time at all, starting from scratch and building an interactive scatterplot by the end of the session. We'll use d3.js, the web's most powerful library for data visualization, to load data and translate values into SVG elements — drawing lines, points, and scaled axes to label our data. We’ll learn how to use motion and visual transitions, and introduce simple interactivity to make our charts more explorable.

All methods and examples will be up-to-date for the current version of D3 (4.x as of this writing).

Outline & Objectives

Audience and Prerequisites:

Intended for absolute beginners new to D3, yet with some prior programming experience (though not necessarily JavaScript), and some prior web experience (HTML, CSS). Participants should also be comfortable working with basic data formats (such as CSV files).


- Intro to D3 as a tool
- Set up with empty page template
- Selecting elements
- Creating elements
- SVG images and elements
- Data in JavaScript (arrays)
- Binding data to create elements
- Using transitions between states
- Using scales to position elements
- Adding axes
- Transitions and motion
- Interactivity

About three-quarters of the day will be devoted to the core concepts listed above. The remainder of our time will be spent on topics most relevant to participants. This could include additional topics (different data types, visual or interaction design concerns, geographic maps) or small group exercises and consultation for participants sharing similar concerns.


Participants will leave comfortable using D3 to load data into the browser and map that data to visual elements.

About the Instructor

Scott Murray is a designer, creative coder, and artist who writes software to create data visualizations and other interactive phenomena. His work incorporates elements of interaction design, systems design, and generative art. Scott is in the Learning Group at O’Reilly Media, is author of the O’Reilly title “Interactive Data Visualization for the Web” (the second edition of which will be published in 2017), and has presented two video courses on D3. Scott is also affiliated with the Visualization and Graphics Lab at the University of San Francisco, where he has taught data visualization and interaction design. He is also a Senior Developer for Processing, and is writing a new book with O’Reilly, “Creative Coding and Data Visualization with p5.js: Drawing on the Web with JavaScript.” Scott earned an A.B. from Vassar College and an M.F.A. from the Dynamic Media Institute at the Massachusetts College of Art and Design. His work can be seen at

Relevance to Conference Goals

By the end of this course, participants will be familiar with the most powerful tool for web-based data visualization, and therefore in a good position to better communicate their findings to a global audience. D3 familiarity is in demand with employers, and the skills learned in this session can be applied immediately, for a wide range of projects.

That said, please note that D3 is intended for custom visualization—it has no “templates” or preset “views” or chart types. Exploratory tools like Tableau already serve this purpose. This course is about learning the core concepts of D3, so you can use it to design and develop your own highly customized, interactive data visualizations.

Software Packages

This course will rely heavily on:

- Web standard technologies built into every browser (HTML, CSS, SVG, JavaScript)
- D3 (free, and will be provided; also see

Please bring to the workshop a laptop with the following installed:

- Chrome
- A code editor (I recommend Atom, which is free)

You will also need the ability to run a local web server. You can accomplish this either by:

- Installing a web server application (such as MAMP or WAMP). This is the friendliest, GUI approach, but requires you to download and install everything in advance of the course.
- Use Python or another tool to run a simple server via terminal commands. This requires no additional installation if you are using Mac OS.

Code examples will be distributed at the event. All code examples will be updated and tested with the current version of the the software (version 4.x at the time of this writing) as of February 2018.

SC3 Collaboration Essentials for Practicing Statisticians and Data Scientists
Thu, Feb 15, 8:00 AM - 12:00 PM
Instructor(s): Heather Smith, Cal Poly; Eric Vance, LISA--University of Colorado Boulder
Statisticians and data scientists positively impact many people, organizations, and governments through the careful collection, analysis, and interpretation of data to solve problems and make decisions. To maximize their impact, statisticians and data scientists must effectively collaborate with a variety of domain experts who originate the data or the problems to be solved. In this short course, participants will learn and practice essential skills to improve their professional communication and collaboration to increase their effectiveness on the job. Specifically, participants will learn how to establish foundational collaborative relationships with domain experts; structure effective meetings; and effectively communicate with non-statisticians. Participants will also practice their newly acquired skills and learn how to improve their proficiency in these essential collaboration skills by using role-plays and video coaching and feedback reviews outside of this short course. In sum, participants will learn and practice how to leverage their technical skills to more effectively collaborate for maximal impact inside and outside of their organizations.

Outline & Objectives

Our goal is to unlock the collaborative potential of participants (from beginning to advanced) so they can return to key roles within their organizations and achieve greater impact. Prerequisites are a desire to improve one’s personal effectiveness and openness to try new methods and ways of thinking in the practice of statistics and data science.

Outline and Objectives:
1. Learn how to build foundational collaborative relationships with clients, colleagues, and other domain experts by applying the Fundamental Law of Statistical Collaboration and the QQQ process.

2. Learn how to structure and conduct effective meetings using the POWER structure (Prepare-Open-Work-End-Reflect).

3. Analyze the opening and ending structures of a real meeting (on video) and/or a live role-play using rubrics.

4. Practice applying the POWER structure via focused role-plays with subsequent coaching and feedback.

5. Learn tips for effectively communicating with non-statisticians.

6. Practice listening, summarizing, and paraphrasing statistical and subject matter content; asking good questions; and explaining statistics to non-statisticians using the ADEPT method.

About the Instructor

For the past 9 years, Dr. Eric Vance, an Associate Professor at the University of Colorado, has been the director of LISA (Laboratory for Interdisciplinary Statistical Analysis) where he has trained 245 statisticians to move between theory and practice to collaborate with 9000+ domain experts to apply statistics and data science to answer their research or business questions. He has taught workshops and webinars on collaboration around the world, including three workshops on this topic at JSM from 2014-2016 with Heather Smith.

Heather Smith has 27 years of experience consulting with academic, industrial, service, and government clients in the United States, Europe, and Asia. She began this work as a statistical consultant at Westat, Inc. For 20 years she has been a faculty member in the Statistics Department at Cal Poly San Luis Obispo where she consults with academic and private sector researchers and teaches a wide variety of applied statistics courses including two courses she developed for undergraduate statistics majors, one in Statistical Communication and one in Statistical Consulting. She has offered over a dozen workshops, short courses, and webinars on these topics.

Relevance to Conference Goals

This short course is immediately relevant for two of the three main conference goals. If selected, this short course will teach participants how to better communicate and collaborate with their clients and customers, will provide them with skills and practice on how to have a positive impact on their organization, and will enhance their professional development.

Participants will learn best practices in statistical consulting and collaboration that will enhance their organizational impact and lead to career development and advancement. Participants will return to their jobs with new ideas, techniques, and strategies to improve their ability to communicate and collaborate effectively, resulting in a greater impact on their organizations and increasing the overall impact of statistics and data science in the world at large.

Note: this short course will not be offered at JSM in 2017 because organizers believed that it was a better fit for CSP.

Software Packages

We will not be using any software in this short course.

SC4 A Variety of Mixed Models: Linear, Generalized Linear, and Nonlinear
Thu, Feb 15, 8:00 AM - 12:00 PM
Instructor(s): David A. Dickey, NC State University
The MIXED procedure in SAS, for example, correctly handles linear models that have multiple sources of random effects such as random town to town, store to store, and aisle to aisle variation in sales. Associated fixed effects might be product price, color of packaging and amount spent on advertising. The talk begins with a checklist for deciding when to treat effects as random versus fixed and follows with a series of examples. When the response variable is not normal, for example with a binary or Poisson response, additional complexities arise. Models with such non normal responses are often analyzed by assuming that some transformation, or link function, of the expected value of Y results in a linear model with fixed and random effects. We are then in the generalized linear mixed model setting. It may be that a model cannot be linearized by a transformation, thus making it a nonnlinear model. If random effects are involved the model is referred to as a nonlinear mixed model. With a minimal amount of theory and an emphasis on examples, these types of models will be explained and illustrated. SAS will be used but the ideas and interpretation are software independent.

Outline & Objectives

The presence of random effects in modelling can easily go unnoticed and yet it has profound effects on inference. For that reason this course shows, through a series of descriptions and examples, how to recognize this situation and deal with it. A surprising variety of models such as split plots, unbalanced block designs, and repeated measures to name a few, fall into the linear mixed models category. The impact of correctly incorporating random effects will be illustrated with a simple example. The slightly more complex cases of non normal responses and nonlinear associations are also profoundly affected by the presence of random effects and examples of these will be included.

About the Instructor

David A. Dickey is William Neal Reynolds Distinguished Professor of Statistics at North Carolina State University. He is the co-inventor of the "Dickey-Fuller test" that is commonly discussed in time series texts and is present in many time series software packages. A Fellow of ASA, Dave has presented at all but one of the past CSP conferences and has been program chair of the Business and Economic Statistics section for JSM. At NCSU he was a founding faculty of the Institute for Advanced Analytics, is a member of the Integrated Manufacturing and Systems Engineering Institute, and the Financial Math program. He has an associate appointment in Economics at NCSU and is a member of the Academy of Outstanding Teachers. As a contract instructor for SAS Institute he has taught and helped develop many training courses, including those on time series and mixed models. He is a frequent presenter at SAS Global Forum and is an author in their Books by Users series. He has coauthored several books and written many research articles as well as advising over a dozen PhD students.

Relevance to Conference Goals

The attendees will leave with a new appreciation of modelling and insights into not only how to recognize random effects but how to deal with them. They will have concrete tools to deal with this phenomenon which is very common in practical data analysis. Underlying ideas will be explained in understandable terms and illustrated with interesting and informative examples. Users of SAS will be able to immediately apply the content to their own work and users of other software, with a quick review of syntax, should also be abel to "hit the ground running" upon return to work. I anticipate this course raising the level of analysis and insight for all attending.

Software Packages

SAS will be used exclusively but the emphasis on ideas and interpretation should be software independent.

SC5 Cleaning Up the Data Cleaning Process: Challenges and Solutions in R
Thu, Feb 15, 8:00 AM - 12:00 PM
Instructor(s): Claus Thorn Ekstrøm, Biostatistics, University of Copenhagen; Anne Helby Petersen, Biostatistics, University of Copenhagen
Data cleaning and validation are the first steps in any data analysis, as the validity of the conclusions from the analysis hinges on the quality of the input data. Mistakes in the data can arise for any number of reasons, including erroneous codings, malfunctioning measurement equipment, and inconsistent data generation manuals. We present a systematic, analytical approach to data cleaning that will ensure the data cleaning process to be just as structured and well-documented as the rest of the data analysis. The primary software tool is the dataMaid R package, which implements an extensive and customisable suite of quality assessment tools that can be used to identify potential problems in a dataset. The results are summarised in an auto-generated, non-technical, stand-alone document readable by statisticians and non-statisticians alike. Thus, the course teaches practical skills that aid the dialogue between data analysts and field experts, while also providing easy documentation of reproducible data cleaning steps and data quality control.

Outline & Objectives

The course will consist of an interchange between teaching and hands-on interactive sessions, where the participants work with messy data in R, mostly using the dataMaid R-package. Thereby, we establish common grounds and a common vocabulary for understanding and describing the process of data cleaning as an analytical practice, rather than a number of ad-hoc steps. Moreover, the participants will be introduced to the possibilities of the dataMaid R-package and will learn how to use the software for producing documentable data overview reports that are relevant for their specific data cleaning needs.

If necessary, the course will split into two parallel sessions where experienced R developers are introduced to the semantics of writing dataMaid extensions, while less trained R users will focus on how dataMaid can be used interactively in the R console, so attendees of all skills levels are encouraged to join.

Participants are assumed to be R-users, but not necessarily familiar with writing R extensions.

About the Instructor

Claus Thorn Ekstrøm is professor at the section of Biostatistics, University of Copenhagen, and has taught statistics courses at bachelor, master, and graduate levels for more than 15 years. He is the creator and contributor to a number of R packages (dataMaid, MESS, MethComp, SuperRanker) and is the author of "The R Primer" book. He has previously given tutorials on Dynamic graphics in R and the role of interactive graphics in teaching, and won the C. Oswald George prize for his article "Teaching 'Instant Experience' with Graphical Model Validation Techniques" in 2014.

Anne Helby Petersen holds a MSc in statistics and is the main author of the dataMaid R-package and the companion scientific manuscript. She has worked as a TA on several courses in mathematics and statistics.

Relevance to Conference Goals

The short course teaches the participants new practical tools that will aid their daily work with data cleaning and data quality assessment. More specifically, the participants will be able to use the standard dataMaid solution and to make simple, customised extensions, thereby targeting a wide variety of data cleaning challenges. As the dataMaid software focuses on auto-generated reports that are readable by non-R users, these tools will also help the participants in their communication with collaborators, clients and field experts, who might not be familiar with R or statistics in general. Moreover, by discussing data cleaning, not as a nuisance, but as a real, scientific practice, the participants will find themselves to be better equipped for planning and time-framing data cleaning in the future.

Software Packages

In the course, we will only use open-source software within the domains of the statistical programming language R.
More specifically, we will use the R-package dataMaid, which is available through CRAN. The packages validate, editrules, and deducorrect will also be discussed.

SC6 Effective Presentation for Statisticians and Data Scientists: Success=(PD)^2
Thu, Feb 15, 1:30 PM - 5:30 PM
Instructor(s): Jennifer H Van Mullekom, Virginia Tech
Statisticians must be able to effectively convey their ideas to clients, collaborators, and decision-makers. Presenting in the modern world is even more daunting when speakers have the opportunity to employ slideware, videos, and live demos. Unfortunately, university coursework and professional development programs are often not targeted towards sharpening these skills. This short course, developed and taught by statisticians, will provide an opportunity to learn how to employ different methods and tools in the phases of the framework taught. The material covered in the course is geared toward data-based presentations and is based on the works of Garr Reynolds and Michael Alley, among others. The course will emphasize the importance of stepping away from the computer to Prepare an effective message aimed at your core point guided with a series of questions and tips. The Design phase emphasizes the importance of structure, streamlining, and good graphic design accompanied by a series of checklists. Of course, “Practice makes perfect” so we cannot skip this step. Finally, engaging the audience and effectively using the room and equipment is covered in the Deliver phase.

Outline & Objectives

At the end of this course, participants will have an arsenal of techniques, methods, tips, and tricks to prepare, design, practice and deliver effective presentations to decision makers and research audiences.
I. Prepare
a. Questions you must answer before your presentation
b. Steps for creating the story of your facts
c. Tips and tricks
d. Deep dive into analogies, diagrams and examples for statistics
II. Design
a. Simplicity
b. Structure
c. Sight
d. Streamline
e. Data/Statistics Slide Makover Exercises
III. Practice
a. How to practice
b. How to use practice to improve your delivery
IV. Deliver
a. How you look, sound, and move
b. Overcoming nerves
c. Give a 5 minute presentation using the techniques you have learned during the day
V. Special Topics
a. Webinars & Teleconferences
b. Global Audiences
c. Non-native English Speakers
d. Dealing with Difficult People
e. Casual Meeting Updates and Report Outs

About the Instructor

Jennifer Van Mullekom is currently an Associate Professor of Statistical Practice at Virginia Tech where she leads the Laboratory for Interdisciplinary Statistical Analysis (LISA). Here she provides statistical collaboration for on-campus research and her duties include securing funding, setting direction, mentoring and teaching students, and providing technical statistical support to LISA. Prior to this role, she served as a Senior Consulting Statistician with Dupont. She has been actively involved in the American Statistical Association's Section on Physical and Engineering Sciences (SPES) since 1998 and has held various positions in the organization. Jen has participated in numerous conference committees with ASA including the FTC and the CSP. She has also co-developed the American Statistical Association’s “Effective Presentations for Statisticians” Course. Her statistical areas of interest include equivalence testing, regression modeling, response surface designs, mixed models, and statistical engineering.

Relevance to Conference Goals

This short course embodies the topic of communicating complicated analyses in simple ways for non-statisticians/decision makers. Effective communication then encourages collaboration and consequently leads to career advances.

Note: This course was developed in conjunction with ASA's career success factor's task force several years ago. It is set up to be a full 8 hours. Portions of it could be set up as either a half day course or a tutorial but the full content could not be condensed to two or four hours. It also works very well as two half day sessions which allows participants time to work on their presentations for the final session.

Software Packages

PowerPoint, Keynote, Prezi

SC7 Statistical Learning Methods in R
Thu, Feb 15, 1:30 PM - 5:30 PM
Instructor(s): Kelly Sue McConville, Swarthmore College
Applied statisticians are often confronted with difficult modeling problems where standard regression approaches are not appropriate. For example, it may be that the number of possible predictors is large relative to the sample size or that the relationship between the variables is non-linear. This course will cover several statistical learning techniques which are designed to handle these difficult modeling problems. In particular, we will study penalized regression techniques (lasso, ridge, elasticnet), non-parametric regression (regression and smoothing splines), and classification methods (support vector machines, trees). Using data from the Bureau of Labor Statistics, participants will learn how to fit these models in R. R Markdown files with the relevant code will be provided so that participants can actively follow along with the demonstrations.

Outline & Objectives

The three main topics of the course are:

1. Penalized parametric regression with the lasso, ridge and elasticnet.

2. Penalized nonparametric regression with regression and smoothing splines.

3. Classification with logistic regression and support vector machines.

By the end of the course, participants should

• Have a basic understanding of several statistical learning methods and their applicability.

• Be able to build the models in R.

• Be able to compute measures that allow for comparisons between methods.

About the Instructor

Dr. McConville is an Assistant Professor of Statistics at Swarthmore College. She has a PhD in Statistics from Colorado State University. Her research focuses on the adaptation of statistical learning techniques to data from a complex sample design. She collaborates with the US Forest Service Forest Inventory and Analysis Program and the US Bureau of Labor Statistics. She teaches statistical learning and R in many of her courses at Swarthmore.

Relevance to Conference Goals

Through practical examples, the course will expose participants to popular statistical learning methods. Learning these powerful predictive techniques will expand their modeling toolbox.

Software Packages

R and RStudio will be used throughout the course. Participants are strongly encouraged to bring computers with R and RStudio installed beforehand.

SC8 NISS Shortcourse: A Survey of Modern Data Science
Thu, Feb 15, 1:30 PM - 5:30 PM
Instructor(s): David Banks, Dept. of Statistical Science, Duke University
Modern data science is driven by applications, and these often entail Big Data and machine learning perspectives. This short course reviews key ideas and methods in nonparametric regression (starting with cross-validation and light bootstrap asymptotics, then moving on to the additive model, the generalized additive model, and neural networks. It also covers variable selection, with the Lasso and the Median Model, and describes the p >> n problem in the context of contributions by Candes and Tao, Donoho and Tanner, and Wainwright. The course next treats classification, with emphasis upon Random Forests, boosting, and ensemble strategies such as bagging, stacking and boosting.

Outline & Objectives

The course intends to convey the intuition and heuristics that underlay the evolution of data mining, machine learning, and data science from the 1990s to the present day. The target audience is MS- level practitioners who have some comfort with regression analysis.

About the Instructor

David Banks is a professor at Duke University who has taught this material in a graduate course on machine learning on multiple occasions. In 2017, he taught this short course at
he Kansas State University's Agricultural Statistics conference.

Relevance to Conference Goals

This short course aligns with the CSP's Theme 3: Data Science and Big Data. It will introduce people to a toolkit of methodologies, with instruction and guidance on when and why to use these tools, and what issues may arise. Attendees will learn statistical methods that should help them to advance in their analytical careers.

Software Packages

No specific software will be taught. Most of the methods discussed have implementations in R, Matlab, and (sometimes) SAS.

PS1 Poster Session 1 and Opening Mixer
Thu, Feb 15, 5:30 PM - 7:00 PM

1 Some Dimension Reduction Strategies for the Analysis of Survey Data
Jiaying Weng, University of Kentucky
2 Perl-compatible regular expressions as a tool to abstract semi-structured electronic health records
Samantha Emily Montag, Northwestern University
3 Collaborative Process to Efficiently Produce Publications in Multicenter Research
Cody S Olsen, University of Utah, Department of Pediatrics
4 Developing a Comprehensive Personal Plan for Teleworking (Working Remotely)
Julia Lull, Janssen Research & Development, LLC
5 A Decision Tool for Causal Inference and Observational Data Analysis Methods in Comparative Effectiveness Research (DECODE CER)
Douglas Landsittel, University of Pittsburgh
6 Thank You, Come Again: Modeling Repeat Purchase Behavior for Business Travelers
Diag D Davenport, Georgetown University
7 Wavelet Based Methods for Data-Driven Monitoring
Achraf Cohen, University of West Florida
8 A Simulation Study of Violations of the Local Independence Assumption in Latent Class Analyses
Michael P Chen, U.S. Centers for Disease Control and Prevention
9 Impact of linear regression predictor omission on estimation and inference
Julia L Sharp, Colorado State University
10 Improving predictive models when using imperfect data: the use of multiple imputation to correct for the effect of a missing at random missing data mechanism on predictive model performance measures.
Taron Dick , West Virginia University
11 A Comparison of Standard Logistic Regression, Multilevel Modeling, Robust Error Estimation, and Exposure Simulation for Data Containing Quasi-Berkson Error
Angelique Liddell Zeringue, Mercy Healthcare
12 Statistical Modeling for Repeated Measures in Rubber Research
Wenzhao Yang, The Dow Chemical Company
13 Combining Historical Data and Propensity Score Methods in Observational Studies to Improve Internal Validity
Miguel Marino, Oregon Health & Science University
14 Marketing Communication Channel Preference Optimization using a two-stage statistical modeling
Hongying Yang, Statistical consultant
15 Using accessible patient data to individualize sample timing for pharmacokinetic studies
Matthew Stephen Shotwell, Vanderbilt University Medical Center
16 Limitations of propensity score methods: demonstration using a real-world example
Gregory B Tallman, Oregon State University/Oregon Health & Science University
17 Effect size measures for nonlinear count regression models
Stefany Coxe, Florida International University
18 Appropriate dimension reduction for sparse, high-dimensional data using Intensity plots and other visualizations
Eugenie Jackson, West Virginia University
19 Navigating Large-Scale Forest Plots Using R and Shiny
Steele Valenzuela, Oregon Health & Science University
20 Ranked-Choice Voting R Package
Jay Lee, Reed College
Exhibits Open
Thu, Feb 15, 5:30 PM - 7:00 PM

Friday, February 16
Fri, Feb 16, 7:30 AM - 5:30 PM

Continental Breakfast
Fri, Feb 16, 7:30 AM - 8:30 AM

Exhibits Open
Fri, Feb 16, 7:30 AM - 6:30 PM

GS1 Keynote Address
Fri, Feb 16, 8:00 AM - 9:00 AM

8:05 AM Reflections on Career Opportunities and Leadership in Statistics
Lisa LaVange, Center for Drug Evaluation and Research, US Food and Drug Administration
CS01 #LeadWithStatistics
Fri, Feb 16, 9:15 AM - 10:45 AM

9:20 AM Elegant Influence: Powerful Persuasion without the Push
Colleen Mangeot, Cincinnati Children's Hospital
10:45 AM Developing and Delegating: Two Key Strategies to Master as a Technical Leader
Diahanna L Post, Nielsen, Columbia University
CS02 Practical Considerations for Modeling
Fri, Feb 16, 9:15 AM - 10:45 AM

9:20 AM Evaluating Model Fit for Predictive Validity
Katherine M. Wright, Northwestern University
10:05 AM Flexible Modelling and Experimental Design Strategies
Timothy E. O'Brien, Loyola University Chicago
CS03 Text Analytics Applications
Fri, Feb 16, 9:15 AM - 10:45 AM

9:20 AM Approachable, interpretable tools for mining and summarizing large text corpora in R
Luke W. Miratrix, Harvard University
10:05 AM Latent Dirichlet Allocation Topic Models Applied to the Center for Disease Control and Prevention’s Grant
Matthew Keith Eblen, Centers for Disease Control and Prevention
CS04 Working with Messy Data
Fri, Feb 16, 9:15 AM - 10:45 AM

9:20 AM Practical Time-Series Clustering for Messy Data in R
Jonathan Robert Page, University of Hawaii Economic Research Organization (UHERO)
10:05 AM Doing Data Linkage: A Behind-the-Scenes Look
Clinton J Thompson, National Center for Health Statistics, CDC
CS05 Collaboration Essentials
Fri, Feb 16, 11:00 AM - 12:30 PM

11:05 AM Asking Great Questions
Eric Vance, LISA--University of Colorado Boulder
11:50 AM Listening, Summarizing, and Paraphrasing
Heather Smith, Cal Poly
CS06 Bayesian Applications
Fri, Feb 16, 11:00 AM - 12:30 PM

11:05 AM Bayesian Inference for Stochastic Processes
Lyle David Broemeling, University of Texas MD Anderson Cancer Center
11:50 AM Forecasting Periodic Accumulating Processes with Semiparametric Distributional Regression Models and Bayesian Updates
Harlan D. Harris, WayUp
CS07 Exploring Big Data
Fri, Feb 16, 11:00 AM - 12:30 PM

11:05 AM Exploratory data structure comparisons by use of Principal Component Analysis
Anne Helby Petersen, Biostatistics, University of Copenhagen
11:50 AM Tools for Exploratory Data Analysis
Wendy L Martinez, U.S. Bureau of Labor Statistics
CS08 Streamlining Your Work using Apps
Fri, Feb 16, 11:00 AM - 12:30 PM

11:05 AM Mechanizing Clinical Review Processes with R Shiny for Efficiency and Standardization
Jimmy Wong, Food and Drug Administration
11:50 AM Building Shiny Apps: With Great Power Comes Great Responsibility
Jessica Minnier, Oregon Health & Science University
Lunch (on own)
Fri, Feb 16, 12:30 PM - 2:00 PM

CS09 Presenting & Storytelling
Fri, Feb 16, 2:00 PM - 3:30 PM

2:05 PM How to Give a Really Awful Presentation
Paul Teetor, William Blair & Co
2:50 PM Telling the Story of Your Stats
Jennifer H Van Mullekom, Virginia Tech
CS10 Propensity Scores & Resampling Methods
Fri, Feb 16, 2:00 PM - 3:30 PM

2:05 PM A Streamlined Process for Conducting a Propensity Score-based Analysis
John A Craycroft, University of Louisville
2:50 PM Resampling methods for statistical inference on multi-rater kappas
Chia-Ling Kuo, University of Connecticut Health
CS11 Data Mining Algorithms
Fri, Feb 16, 2:00 PM - 3:30 PM

2:05 PM Stochastic gradient boosting on distributed data
Roxy Cramer, Rogue Wave Software
2:50 PM Deep Neural Networks for Scalable Prediction
Lynd Bacon, Loma Buena Assoc./Notre Dame Univ./Northwestern Univ.
CS12 Visualizing Data to Help Manage & Communicate
Fri, Feb 16, 2:00 PM - 3:30 PM

2:05 PM Powering Crisis Communications Using Data Visualization -- A Real Life Example
Eric c Newburger, Subject Matter
2:50 PM The Life-cycle of a Project: Visualizing Data from Start to Finish
Nola du Toit, NORC at the University of Chicago
CS13 Managing Up
Fri, Feb 16, 3:45 PM - 5:15 PM

3:50 PM What does it take for an organization to make difficult information-based decisions? Using the Oregon Department of Forestry’s RipStream project as a case study
Jeremy Groom, Groom Analytics
4:35 PM Statistics for Management of an Organization
Joyce Nilsson Orsini, Fordham University Graduate School of Business
CS14 Working With Healthcare Data
Fri, Feb 16, 3:45 PM - 5:15 PM

3:50 PM Application of Support Vector Machine Modeling and Graph Theory Metrics for Disease Classification
Jessica Michelle Rudd, Kennesaw State University
4:35 PM ASSESSING CORRESPONDENCE BETWEEN TWO DATA SOURCESAssessing Correspondence Between Two Data Sources Across Categorical Covariates With Missing Data – Application To Electronic Health Records
Emile Latour, Oregon Health & Science University
CS15 Statisticians Teaching
Fri, Feb 16, 3:45 PM - 5:15 PM

3:50 PM Should I bring a basket of fish or some fishing poles?
Kathy Hall, Hewlett Packard
4:35 PM Engaging Undergraduates in Statistical Consulting
Christina Phan Knudson, University of St. Thomas
CS16 Novel Applications of Data Visualization
Fri, Feb 16, 3:45 PM - 5:15 PM

3:50 PM Warranty/Performance Text Exploration for Modern Reliability
Scott Lee Wise, SAS Institute, Inc.
4:35 PM Improving the Data Customer’s Ability to Visualize Historical Agricultural Data at the National Agricultural Statistics Service
Irwin Anolik, USDA-NASS
PS2 Poster Session 2 and Refreshments
Fri, Feb 16, 5:15 PM - 6:30 PM

1 Data, data everywhere …, but mind the disclaimers: benefits and costs of matching large cohorts to individual US mortality case data in the NDI, SSA Death Master File (DMF/SSDI), and more
Sigurd Wilson Hermansen, Westat
2 Curating and visualizing big data from wearable activity trackers
Meike Niederhausen, OHSU-PSU School of Public Health
3 Consensus Strategy for Variable Selection in Clinical Prediction Rule Development
Miriam R Elman, OHSU/OSU College of Pharamcy
4 Reproducible research implemented through version control systems
Lillian S Lin, Montana State University
5 The Boeing Applied Statistics ToolKit: Best Practices and Tools for Collaboration and Reproducibility in High Throughput Consulting
Robert Michael Lawton, Boeing Research & Technology
6 Empirical comparisons of differential expression analysis pipelines for RNA-sequencing data
Lina Gao, Biostatistics Shared Resource (OHSU BSR); Biostatistics and Bioinformatics Unit (ONPRC BBU)
7 A practical guide for modeling length of stay with focus on right skewness and zero inflation
Lizhou Nie, Stony Brook University
8 Nonparametric estimation of time-variant quantiles and statistical models
Jessica Michelle Rudd, Kennesaw State University
9 Estimating the Relative Excess Risk Due to Interaction in Clustered Data Settings
Katharine Fischer Berry Correia, Harvard T.H. Chan School of Public Health
10 Bayesian Inference for Dependent ? Statistics
Pin Li, Department of Biostatistics, University of Michigan
11 Spatial Analysis of Fukushima Thyroid Ultrasound Examination Survey Data
Emerson H Webb, Reed College
12 A Growth Reference for mid Upper Arm Circumference for Age among School Age Children and Adolescents, with Validation for Mortality in Two Cohorts
Lazarus K Mramba, University of Florida
13 Machine Learning Methods for Predicting Zygosity
Ally Rochelle Avery, Washington State University
14 Simulating real-world data with time-varying variables
Maria Emilia de Oliveira Montez-Rath, Stanford University
15 Evaluating the Effectiveness of the Flipped Classroom Model Using Structural Equation Modeling
Shan Wang, Assistant Professor
16 Predictive Modeling to Reduce Hospital Readmissions
Suzanne Ryan, Capital District Physicians' Health Plan
17 Software for Covariate Specification in Linear, Logistic, and Survival Regression
Sai Liu, Stanford University
18 Exploratory Analyses from Different Forms of Interactive Visualizations
Lata Kodali, Virginia Tech
19 Using SAS Programming to Create Complex Paneled Graphs from Electronic Health Records
Carrie Tillotson, OCHIN, Inc.
20 An algorithm to identify family linkages using electronic health record data
Megan Hoopes, OCHIN, Inc.
Saturday, February 17
Sat, Feb 17, 7:30 AM - 2:30 PM

Exhibits Open
Sat, Feb 17, 7:30 AM - 1:00 PM

PS3 Poster Session 3 and Continental Breakfast
Sat, Feb 17, 8:00 AM - 9:15 AM

1 Thematic Feature Selection for Research Support
Thealexa Becker, Federal Reserve Bank of Kansas City
2 The Case For Nearliers - A New Method For Sampling at a Significantly Lower Cost
Jeffry N. Savitz, SavitzConsulting, LLC
3 Systematizing your Statistical Consulting Practice
Terrie Vasilopoulos, University of Florida, College of Medicine
4 16 Personalities at Work
Katherine Eleanor Tranbarger Freier, Intel Corporation
5 Re-xamining sick quitter hypothesis on association of alcohol consumption with coronary heart disease
Amy Z. Fan, National Institute of Health
6 Comparisons of propensity score analysis for analyzing rare binary outcome
Jihye Park, Stony Brook University
7 Understanding Graduate School Speed-dating with Generalized Linear Mixed Models
Christina Phan Knudson, University of St. Thomas
8 Data modelling to mitigate the impact of missing data in a longitudinal study of injecting drug users.
Tania Amanda Patrao, University of Queensland, Australia
9 Application of Bayesian Spatial Statistics in Archaeology: Implementation of CAR Prior to Analyzing Aracheological Statistical Records of the Inca Empire
Anastasiya Travina, The University of Texas at Austin
10 Multivariate Statistical Analysis in Plastic Foam Research
Wenyu Su, The Dow Chemical Company
11 Win Ratio Application for a Composite Outcome in a Randomized Cardiovascular Trial
Rose A Hamershock, TIMI Study Group
12 Treatment Decision in Ischemic Cardiomyopathy: Causal Inference Using Random Survival Forests.
Min Lu, University of Miami
13 Statistical Analysis of Network Change
Teresa Danielle Schmidt, Portland State University
14 Exploring Data Quality and Time Series Event Detection in 2016 US Presidential Election Polls
Kaelyn M. Rosenberg, Reed College
15 Understanding and Using Ordinal Factor Analysis
Nivedita Bhaktha, The Ohio State University
16 CovTest: An R Package for Covariance Matrix Testing with Applications to High-dimensional Data
Ben Joseph Barnard, Baylor University
17 Developing tidy tools for cross-functional teams
Emily Riederer, Capital One
18 An Easy-to-use SAS® Macro for a Descriptive Statistics Table with P-values
Yuanchao Zheng, Stanford University
19 Animated Data Visualization with Plotly: Useful Tool for Healthcare Quality Improvement
Eric A. Tesdahl, SpecialtyCare, Inc.
CS17 Passion for Statistics
Sat, Feb 17, 9:15 AM - 10:45 AM

9:20 AM Am I supposed to enjoy my job? Career observations from a biostatistician
Daniel Thomas Cotton, Boehringer Ingelheim Pharmaceuticals
10:05 AM Statistics in the Wild: Practicing statistics in nontraditional places from a tiny island in the Pacific to the Federal Cabinet
Heather Krause, Datassist
CS18 Survival Analysis v. "Survival" Analysis
Sat, Feb 17, 9:15 AM - 10:45 AM

9:20 AM “How long would you wait?” -- Using Time-to-Event (Survival) Analysis to Explore Waiting Times
Ruth Hummel, SAS Institute
10:05 AM Statistical Methods for National Security Risk Quantification and Optimal Resource Allocation
Robert Brigantic, Pacific Northwest National Laboratory
CS19 Business Intelligence Applications
Sat, Feb 17, 9:15 AM - 10:45 AM

9:20 AM Business Intelligence (BI) Reporting Solution: From Source to Nuts
Lisa Wood, University of Michigan
10:05 AM Location Analytics: An application of GIS
Yue Fang, CEIBS
CS20 Understanding Populations
Sat, Feb 17, 9:15 AM - 10:45 AM

9:20 AM Quantifying Populations in Proximity to Oil and Gas Development: A National Spatial Analysis and Review
Tanja Srebotnjak, Harvey Mudd College
10:05 AM Approaches and techniques for estimating the total number of species in a population: with emphasis on application to mineral species
Grethe Hystad, Purdue University Northwest
CS21 Developing Communication Skills
Sat, Feb 17, 11:00 AM - 12:30 PM

11:05 AM How to communicate statistics and how statisticians should communicate?
Achim Guettner, Novartis Pharma
11:50 AM Communication Skills for Statisticians: What's Next?
Janice Derr, Retired
CS22 Small Sample Sizes & Non-Probability Sampling
Sat, Feb 17, 11:00 AM - 12:30 PM

11:05 AM Quantifying and incorporating sources of variability and uncertainty in statistical analyses with very small sample sizes
Annette M Bachand, Ramboll Environ
11:50 AM Non-Probability Sampling - Wave of the Future in Survey Research?
Karol Krotki, RTI
CS23 Data Science Applications
Sat, Feb 17, 11:00 AM - 12:30 PM

11:05 AM Recent Advances on the Analysis and Detection of Communities in a Network
Frederick Kin Hing Phoa, Institute of Statistical Science, Academia Sinica
11:50 AM Firehose Data Science: Real-Time Analytics of Twitter Feeds
David Corliss, Ford Motor Company
CS24 Causal Inference
Sat, Feb 17, 11:00 AM - 12:30 PM

11:05 AM Causal Inference with Multilevel Data Structures
Luke Keele, Georgetown
11:50 AM Personalized Treatment, Uplift Modeling and Counterfactuals: Birds of a Feather
Herbert Ira Weisberg, Causalytics LLC
Lunch (on own)
Sat, Feb 17, 12:30 PM - 2:00 PM

PCD1 Deploying Quantitative Models as "Visuals" in Popular Data Visualization Platforms
Sat, Feb 17, 2:00 PM - 4:00 PM
Instructor(s): Daniel Fylstra, Frontline Systems Inc.
Data visualization and business intelligence tools such as Tableau and Power BI have become extremely popular in recent years. Tableau reports that over 90% of Fortune 500 companies are now customers, while Microsoft reports that over 200,000 organizations of all sizes are using Power BI. These tools currently offer easy-to-use access to many data sources, powerful facilities for "slicing and dicing" data, and rich, flexible data visualization, but only limited built-in analytics methods.

A new avenue has emerged in the past year for extending analytics methods in both Tableau and Power BI- and this provides a new way for an analyst to develop quantitative models outside these platforms, then deploy them as 'visuals' inside Tableau and Power BI, in 'dashboards' which are often published for use by thousands of users in an organization. Though originally conceived as a way to extend the range of visualization styles, these components can perform arbitrary computations on data before it is rendered in visual form.

In this session, Excel Solver developer Frontline Systems, one of the first to explore this new avenue, will demonstrate use of its tools to automatically convert existing quantitative models into 'visuals' for both Tableau and Power BI. Among other options, this enables an analyst to convert predictive (data mining, machine learning) or prescriptive (optimization, simulation) model from Microsoft Excel into an easily-deployed 'visual', just two mouse clicks. No programming is required, but the ability to extend models using high-level RASON modeling language code or programming language code is available. These 'visuals' are full-fledged models that easily connect to any Tableau or Power BI data source, and re-solve the underlying problem whenever the data sources are refreshed.

PCD2 Handling Missing Data Using Multiple Imputation
Sat, Feb 17, 2:00 PM - 4:00 PM
Instructor(s): Yulia Marchenko, StataCorp LLC
This workshop will cover the use of Stata to perform multiple-imputation analysis. Multiple imputation (MI) is a simulation-based technique for handling missing data. The course will provide a brief introduction to multiple imputation and will demonstrate how to perform multiple imputation in Stata. The three stages of MI (imputation, completed-data analysis, and pooling) will be discussed with accompanying Stata examples. Imputation using multivariate normal (MVN) and using chained equations (MICE, FCS) will be discussed. A number of examples demonstrating hot to efficiently manage multiply imputed data within Stata will also be provided. Linear and logistic regression analysis of multiply imputed data as well as several postestimation features will be presented. No prior knowledge of Stata is required, but basic familiarity with multiple imputation will prove useful.

T1 Engage the Room: Mastering Your Personal Presentation Style
Sat, Feb 17, 2:00 PM - 4:00 PM
Instructor(s): Duncan Burl Gilles, Art of Problem Solving
As confident as we may be in the quality of our work, presentation can make or break the impact it has. Engaging the room and communicating clearly can make the difference between an unimpressed, bored audience and a thrilled audience eager to learn more. This course will focus on presentation techniques that help you communicate your ideas effectively and in an engaging manner. You’ll be trained on ways to draw your audience into your talk, engage them in active listening and thinking, and use your voice and the space of the room to command attention and convey your message. These are skills applicable in many areas – whether presenting your work to clients, teaching in the classroom, one-on-one interviews or discussions, and even CSP talks! After the talk, participants will have the chance to send a short video of a talk to the presenter for review and feedback.

Outline & Objectives

The primary objective of this course is to help participants understand some of the qualities of captivating speakers, and to know how they can develop these qualities in themselves. There are many ways to engage an audience, so we’ll also discuss how to utilize your personal qualities and strength to engage a room. Participants will also receive some direct feedback, either by volunteering to give a short talk in front of the group or sending a video to the presenter afterwards.

This course is applicable to anyone who wishes to improve their presentation and public speaking skills, both in front of groups and one-on-one. Specifically, the course will be discussing:
1) Knowing your strengths
2) Knowing your audience
3) Engage your audience
4) Taking advantage of space
5) Using presentation aids wisely
6) (Time Permitting) Presentations from the group

About the Instructor

Duncan Gilles (MS) has been a teacher, faculty manager and teacher trainer for over a decade. For 5 years he managed and trained Kaplan Test Prep’s SAT, GRE, GMAT, LSAT and MCAT faculty in the New England Region and currently manages the teacher pool at the Art of Problem Solving – an online math school. He has experience training and giving feedback to presenters in multiple environments – physical classes, online video-based classes, online text-based classes and one-on-one meetings. He’s been recognized as an Elite Teacher for Kaplan, and has trained or provided feedback for over 200 presenters/teachers in the course of his work.

Relevance to Conference Goals

While we hope that our work will stand on its own, how we present our work when engaging with clients and customers can have a big effect on how well it is received. This course will help participants develop their communication skills, giving them immediately actionable tips on how to better engage with their audience. In addition, participants will have a chance to get personalized feedback on their own work, either by giving a short talk at the conference itself, or sending in a video to the presenter afterwards. In addition to affecting their personal presentations, this course will enable participants to be better representatives of their organizations.

Software Packages


T2 Applying Propensity Score Methods to Observational Studies Using R and SAS
Sat, Feb 17, 2:00 PM - 4:00 PM
Instructor(s): Haiyan Bai, University of Central Florida; Wei Pan, Duke University
Observational studies are common in applied settings but pose threats to the validity of causal inference due to selection bias in the data. Propensity score methods have been increasingly used as a means of reducing selection bias to enhance the causal claims. A training course on the application of propensity score methods to observational studies using commonly used statistical software would be beneficial for applied statisticians and researchers to improve the quality of their observational studies. With this objective, the proposed course will introduce basic concepts and practical issues of propensity score methods, including matching, stratification, and weighting; the instructors will facilitate hands-on activities of applying propensity score methods to observational studies with real-world examples using R and SAS. No prior knowledge of propensity score methods or computer programming is required. Participants are encouraged to bring their own laptop computers for hands-on activities.

Outline & Objectives


(1) Mini-lecture on:
• Basic concepts of propensity score methods.
• Various strategies of outcome analysis after matching, stratification, and weighting.
• Issues and developments in propensity score methods.

(2) Demonstration and hands-on activities:
• Demonstrating propensity score matching, stratification, and weighting using R and SAS with real-world data.
• Evaluating covariate balance for propensity score matching.
• Estimating treatment effects after matching, stratification, and weighting.
• Interpreting the results from statistical software for propensity score methods.


(a) Understand the basic concepts and practical issues of propensity score methods.
(b) Discuss why, when, and how to apply propensity score methods in applied settings.
(c) Understand the limitations of propensity score methods.
(d) Learn how to use R and SAS for propensity score methods on observational data.
(e) Interpret the results of propensity score methods.

About the Instructor

Dr. Wei Pan is Associate Professor and Director of the Research Design and Statistics Core at Duke University School of Nursing. Propensity score methods are one of his major research interests. He has published and presented numerous articles on propensity score methods in the past 10 years. Dr. Haiyan Bai is Associate Professor of Quantitative Research Methodology at the University of Central Florida. Her research areas include propensity score methods, resampling methods, research design, and measurement and evaluation. She has published many journal articles on propensity score methods in the past 10 years. Both Drs. Pan and Bai have provided more than 10 professional workshops and training courses on propensity score methods at national conventions, such as annual meetings of the American Statistical Association, the American Public Health Association, the American Evaluation Association, and the American Educational Research Association. They also recently published a book entitled, "Propensity Score Analysis: Fundamentals and Developments."

Relevance to Conference Goals

In this course, the participants will be able to understand why, when, and how to apply propensity score methods to observational studies in applied settings and implement propensity score methods in R and SAS. Through step-by-step hands-on activities on real-world examples or their own research data, participants will be able to produce actual analysis results with graphical and statistical presentations and learn how to interpret them. This course will indeed benefit the participants applying propensity score methods using statistical software as a best practice in statistical analysis, design, and consulting. This course will also increase participants’ overall analytical capacities to improve the quality of observational studies in applied settings. This course is appropriate for applied statisticians, researchers, and scientists. It provides opportunities for them to enhance their career development.

Software Packages

R and SAS (or SAS University Edition which along with R is free to public).

T3 A Workshop on Validation of Discrete Response Statistical Models
Sat, Feb 17, 2:00 PM - 4:00 PM
Instructor(s): Raul Eduardo Avelar Moran, Texas A&M Transportation Institute
Count models are widely used to analyze discrete data in various fields. When the intent of the analysis is prediction, model validation is an important step before the model can be offered with confidence to final users. This tutorial will discuss when and why to validate, and will demonstrate model validation techniques specific to discrete response models, such as Poisson and Negative Binomial Generalized Linear Regression Models.

Outline & Objectives

The objectives of the tutorial are:
1. Participants will acquire working knowledge of validation of discrete response models for various applications.
2. Participants will learn to apply three different validation techniques appropriate for discrete response models.
The tutorial will provide general background and motivation for model validation. A brief description of three validation techniques will be provided. The techniques will be demonstrated using a sample data set. Finally, the tutorial will offer a window for discussion and questions.

About the Instructor

Dr. Avelar is an Associate Researcher Engineer at the Texas A&M Transportation Institute (TTI). His areas of expertise include: transportation safety, roadway and pedestrian operations, data management and processing, and statistical modeling.
Since initially joining TTI in 2012 as a Post-Doctoral Research Scientist, Dr. Avelar has been involved in more than 40 transportation-related research projects for state and federal sponsors in the United States. His role in the vast majority of these projects has been as a transportation statistician, with main responsibilities for data management and statistical analyses.
Dr Avelar is the recipient of the 2016 Patricia F. Waller Paper Award from the Transportation Research Board (TRB) for best paper in the area of safety and system users; the 2016 Outstanding Paper Award from the TRB Committee on Safety Data, Analysis and Evaluation ANB20; the 2015 D. Grant Mickle Award from TRB for best paper in operations and maintenance; and the 2014 Outstanding Paper Award, from the TRB Committee on Pedestrians ANF10.

Relevance to Conference Goals

Data Science and big data rely on discrete response models to develop predictions. Although there is generally good understanding of why and how to use discrete response predictive models, validation of these models is often underrated or skipped altogether. This tutorial will make the case for why we validate and will provide tools to perform validation for discrete response models.

Software Packages

R, R Studio.

T4 Tools for Connecting R, SAS, and Stata to Word: A Practical Approach to Reproducibility
Sat, Feb 17, 2:00 PM - 4:00 PM
Instructor(s): Abigail S Baldridge, Northwestern University; Leah J Welty, Northwestern University
Reproducibility, wherein data analysis and documentation is sufficient so that results can be recomputed or verified, is an increasingly important component of statistical practice. “Weaving” tools such as R Markdown facilitate reproducibility by combining narrative text and analysis code in one plain-text document, but are of limited use when manuscripts or reports must be generated in MS Word (e.g. due to journal requirements or client preference). This course will: (1) summarize how weaving tools create Word documents, and the ensuing limitations; and (2) introduce an alternate approach using recently released StatTag software. StatTag is a free, open-source program that embeds results (values, tables, figures, or verbatim output) from R, SAS, or Stata directly in Word such that they can be automatically updated if code or data changes. This course is intended for a broad audience; prerequisites are experience preparing documents in Word and conducting analysis in any one of R, SAS, or Stata. The workshop will provide practical, hands-on examples drawn from R, SAS, and Stata, and will include an overview of weaving approaches as well as an introduction to StatTag.

Outline & Objectives

1. Introduction to reproducibility
2. Overview of “weaving” tools, worked examples
3. Practical limitations of “weaving” tools with Word
4. An alternate approach using StatTag
5. StatTag instructions and features
6. Hands-on exercises using StatTag to connect Word with R, SAS, and Stata
Provide an overview of reproducibility, focusing on the practical challenges of preparing manuscripts or reports in Word.
Demonstrate how “weaving” tools may be used to generate a Word document. Illustrate the limitations of this approach when documents are subsequently edited in Word.
Present an alternate approach using StatTag software. Provide participants the knowledge and skills to use StatTag to embed values, tables, figures, or raw output from R, SAS, or Stata in Word documents so that: (1) statistical results may be automatically updated if data or models change; (2) the Word document may be edited and formatted as usual without losing the connection to the statistical code.
Provide sample StatTag files for R, SAS, and Stata.
Support a more robust research process by eliminating the need for statistical output to be copied and pasted in to Word documents or reports.

About the Instructor

Leah J. Welty, PhD, Associate Professor in the Department of Preventive Medicine-Biostatistics at Northwestern University, directs the Biostatistics Core Resources within the Northwestern University Clinical and Translational Sciences Institute. She is also the president of the Association of Clinical and Translational Statisticians. She has led the development of StatTag (, and in addition has delivered 8 invited talks and published one manuscript on reproducible research.

Abigail S. Baldridge, MS, Biostatistician in the Department of Preventive Medicine at Northwestern University, is a statistician and project manager for several studies in cardiovascular epidemiology in addition to helping develop and test StatTag. She teaches an MPH course entitled Programming for Statistical Analysis.

Both instructors have extensive, first-hand experience trying to conduct reproducible research while preparing manuscripts in Word: not only do all of their collaborators prefer Word’s familiarity and editing features, but many of the medical journals they publish in require or strongly recommend that manuscripts be submitted in Word.

Relevance to Conference Goals

This workshop was designed to address an important challenge for applied statisticians: how to ensure reproducibility when circumstances require using Word to prepare and edit reports or manuscripts.
Participants will enhance their professional skills by learning how to integrate document preparation in Word with statistical analysis: no longer will a correction to a dataset or change in model parameters entail re-copying results in to a Word document. The workshop will present two approaches – “weaving” and StatTag – with particular focus on the latter, which was recently released and works with R, SAS, or Stata.
The tools presented encourage collaboration and open communication in addition to reproducibility. For example, StatTag provides a link between the statistical results presented in a Word document and the statistical code and data that generated them: double-clicking on any “tagged” result in the Word document pulls up a dialog box displaying the statistical code that created the result. Statisticians and their collaborators may work separately on statistical code and Word documents, but use StatTag to maintain connections between the two.

Software Packages

Participants will need a computer running Windows or macOS with Microsoft Word (2010 or higher for Windows, 2016 for macOS) and one of R v3.2 or higher, Stata 14 or higher or SAS 9.4 or higher. They will need to be able to download and install software (for Windows, this requires “Administrative” rights).
The workshop will use StatTag software, which is free, open source, and publically available at

GS2 Closing General Session
Sat, Feb 17, 4:15 PM - 5:30 PM
The Closing Session is an opportunity for you to interact with the CSP Steering Committee in an open discussion about how the conference went and how it could be improved in future years. CSPSC vice chair, Eric Vance, will lead a panel of committee members as they summarize their conference experience. The audience will then be invited to ask questions and provide feedback. The committee highly values suggestions for improvements gathered during this time. The best student poster will also be awarded during the Closing Session, and each attendee will have an opportunity to win a door prize.