Thursday, February 14

Thu, Feb 14
7:00 AM - 6:30 PM
3rd Floor Registration Counter S

Registration
Registration

Thu, Feb 14
8:00 AM - 5:30 PM
Commerce

SC1 - Regression Modeling Strategies Fill out evaluation
Short Course (full day)

Instructor(s): Frank Harrell, Vanderbilt University

All standard regression models have assumptions that must be verified for the model to have power to test hypotheses and for it to be able to predict accurately. Of the principal assumptions (linearity, additivity, distributional), this course will emphasize methods for assessing and satisfying the first two. Practical but powerful tools are presented for validating model assumptions and presenting model results. This course provides methods for estimating the shape of the relationship between predictors and response using the widely applicable method of augmenting the design matrix using restricted cubic splines. Even when assumptions are satisfied, overfitting can ruin a model's predictive ability for future observations. Methods for data reduction will be introduced to deal with the common case where the number of potential predictors is large in comparison with the number of observations. Methods of model validation (bootstrap and cross-validation) will be covered, as will auxiliary topics such as modeling interaction surfaces, variable selection, overly influential observations, collinearity, and shrinkage, and a brief introduction to the R rms package for handling these problems. The methods covered will apply to almost any regression model, including ordinary least squares, logistic regression models, ordinal regression, quantile regression, longitudinal data analysis, and survival models.

Outline & Objectives

1. Introduction; Advantages of prediction over classication
2. Hypothesis Testing vs. Estimation vs. Prediction vs. Classication
3. How Many Degrees of Freedom does a Data Mining Procedure Ac-
tually Have?
4. Regression Model Notation
5. Model Formulations
6. Interpreting Model Parameters
(a) Nominal Predictors
(b) Interactions
7. Relaxing Linearity Assumption for Continuous Predictors
(a) Categorization is not an alternative
(b) Simple Nonlinear Terms
(c) Splines for Estimating Shape of Regression Function and Deter-
mining Predictor Transformations
(d) Cubic Spline Functions
(e) Restricted Cubic Splines
(f) Choosing Number and Position of Knots
(g) Nonparametric smoothers and regression trees
(h) Advantages of Splines over Other Methods
8. Multiple Degree of Freedom Tests of Association
9. Assessment of Model Fit
(a) Regression Assumptions
(b) Modeling and Testing Interactions
10. Multivariable Modeling Strategy
(a) Why and How To Pre-specify Model Complexity
(b) Problems Caused by Ordinary Stepwise Variable Selection
(c) Collinearity
(d) Shrinkage
(e) Data Reduction
(f) Overly In uential Observations
(g) Some Useful Modeling Strategies for
i. Prediction
ii. Estimation
iii. Hypothesis Testing
11. Overview of the Bootstrap
12. Model Validation
(a) Cross-validation
(b) Bootstrap
13. Graphical Methods for Interpreting Complex Regression Fits
14. Detailed Case Studies
(a) Generalized Least Squares for Serial Data
(b) Ordinal Regression for Continuous Y : Predicting glycohemoglobin
(and pre-diabetes) from body size characteristics using NHANES
data
(c) Binary Logistic Regression: Survival Patterns of Passengers on
the Titanic
(d) Survival Modeling
A more detailed outline is available at biostat.mc.vanderbilt.edu/rms.

About the Instructor

Dr. Harrell is Professor of Biostatistics, Founding Chair of the
Department of Biostatistics of Vanderbilt University School of Medicine, and Expert Statistical Advisor, Oce of Biostatistics, Center for Drug
Evaluation and Research, US FDA. Prior to starting the new department
in 2003 he was Chief of the Division Biostatistics and Epidemiology in the
Department of Health Evaluation Sciences, University of Virginia School
of Medicine. Prior to coming to the University of Virginia in 1996 he
was in the Division of Biometry at Duke University Medical Center for 17
years. He received his Ph.D. in biostatistics from the University of North
Carolina, Chapel Hill in 1979, where he studied under P.K. Sen. Dr.
Harrell's interests include statistical modeling and model validation, sta-
tistical computing and graphics, reproducible research, survival analysis,
clinical trials, health services and outcomes research, medical diagnostic
and prognostic models, bootstrapping, missing data, and Bayesian mod-
eling. He is an associate editor of Statistics in Medicine, a member of
the editorial board for American Heart Journal, a member of Faculty of
1000 Medicine, on the editorial policy board for the Journal of Clinical
Epidemiology and a member of the Scientic Advisory Board, for Science
Translational Medicine. For many years he has been a consultant to FDA
and the pharmaceutical industry. He is author of the book Regression
Modeling Strategies, Second Edition (Springer, 2015) and teaches courses
in biostatistical modeling. He was the recipient of the American Statisti-
cal Association's WJ Dixon award for excellence in statistical consulting
in 2014.

Relevance to Conference Goals

This is an applied statistics course that teaches regression analysis and predictive modeling tools that have wide applicability, and should be of great value to almost all practicing statisticians.

Thu, Feb 14
8:00 AM - 5:30 PM
Royal

SC2 - Big Data, Data Science, and Deep Learning for Statisticians Fill out evaluation
Short Course (full day)

Instructor(s): Ming Li, Amazon; Hui Lin, Netlify

With recent big data, data science and deep learning revolution, enterprises ranging from FORTUNE 100 to startups across the world are hungry for data scientists and machine learning scientists to bring actionable insight from the vast amount of data collected. In the past a couple of years, deep learning has gained traction in many application areas and it becomes an essential tool in data scientist’s toolbox. In this course, students will develop a clear understanding of the big data cloud platform, technical skills in data sciences and machine learning, and especially the motivation and use cases of deep learning through hands-on exercises. We will also cover the “art” part of data science and machine learning to guide participants to learn typical agile data science project flow, general pitfalls in data science and machine learning, and soft skills to effectively communicate with business stakeholders. This course will prepare statisticians to be successful data scientists and deep learning scientist in various industries and business sectors.

Outline & Objectives

The big data platform, data science, and deep learning overviews are specifically designed for audience with statistics education background. The data science workflow, pitfalls and soft skills are highlight through real-world data science and machine learning problems. The Databricks community edition cloud platform will be used throughout the training course to cover hands-on sessions including: (1) big data platform using Spark through R sparklyr package; (2) introduction to Deep Neural Network, Convolutional Neural Network and Recurrent Neural Networks and their applications; (3) deep learning examples using TensorFlow through R keras package. The primary audiences for this course are: (1) statistician in traditional industry sectors such as manufacturing, pharmaceutical and banking; (2) statistician in government agencies; (3) statistical researchers in universities; (4) graduate students in statistics departments. The prerequisite knowledge is MS level education in statistics and entry level of R knowledge. No software installation is needed in students’ laptop and the cloud platform is easily accessed through browsers such as Chrome or Firefox with internet connection.

About the Instructor

Both instructors have Ph.D. in Statistics from Iowa State University and have worked in data science and machine learning areas for a number of years. Dr. Li is a Sr. Data Scientist at Amazon and Dr. Lin is a Data Scientist at Netlify. Before Amazon, Dr. Li was at Walmart, SAS and GE and he was the Chair of the 2017 Quality and Productivity Section of ASA. Dr. Lin was a leader at DuPont on applying advanced data science to enhance marketing and sales effectiveness and she is the co-founder of Central Iowa R User Group and blogger of scientistcafe.com. With deep statistics background and a few years’ industrial experiences in data science, they have trained and mentored numerous junior data scientist with diversified background. They have taught a similar continue education course without the deep learning part at the 2017 JSM, and they will teach similar courses at Joint Research Conference, ICSA Applied Statistics Symposium and Fall Technical Conference in 2018. Dr. Li organized and will present at the Introductory Overview Lecture “Leading Data Science: Talent, Strategy, and Impact” at the 2018 JSM. Dr. Li is also an Instructor of Amazon’s internal Machine Learning University.

Relevance to Conference Goals

This short course fit the conference goals very well. It focuses on Big Data and Data Science applications in real-world problems including the new development of deep learning. With the focus on the cloud platform, students can learn the current trend of data science software and big data infrastructure used by tech companies such that they can expand their programming scope to cover more applications in data science and machine learning. The short course also includes the needed soft-skill discussions to prepare students with better understanding of the data science project flow, pitfalls in machine learning and communication skills. This course keeps statistician’s background in mind to bridge the gaps between a traditional statistician and a successful data scientist. After taking the course, students will be confident to positively impact their organization by transforming their current traditional statistics team into a data science or machine learning team or to explore data scientist or machine learning scientist opportunities for their future career development.

Thu, Feb 14
8:00 AM - 5:30 PM
Canal

SC3 - Successful Data Mining in Practice Fill out evaluation
Short Course (full day)

Instructor(s): Richard D. De Veaux, Williams College

This seminar is a practical introduction to and an overview of the techniques and strategies of data mining. While I will discuss the models in detail, the course will be application rather than theoretically oriented. Many of the standard techniques of data mining, including modern model selection strategies for multiple regression such as the lasso, elastic net etc will be presented. In addition we'll cover classification and regression trees, neural networks, principal components, Naïve Bayes, bagging, and boosting. The course will be problem solving based, using real case studies from science and industry to illustrate which methods work well, when and why. We will emphasize problem formulation, the challenges of the process and the communication that is necessary back to decision makers to effect maximum impact in the organization. No prerequisites other that a knowledge of the basics of regression are assumed. The applications will come from a wide variety of industries and include some applications from my personal experiences as a consultant for companies that deal with such topics as financial services, chemical processing, pharmaceuticals, and insurance.

Outline & Objectives

Outline:
I. Introduction to data mining a. What is data mining? b. What are the applications? c. How does it differ from statistics? 2. Formulating the problem a. Data considerations b. How to evaluate the methods c. Testing and training 3. The methods -- overview of the most commonly used algorithms use 4. Case studies a. In depth comparisons of the methods and how they helped solve the problem b. Challenges to communication 5. Summary

Learning Objectives
(a) Learning outcomes (performance objectives): In the process of analyzing the data sets, attendees will learn how to: • Identify appropriate problems for data mining • Learn how to explore and prepare data for mining • Use a variety of techniques including decision trees and neural nets to build accurate predictive models • Evaluate the quality of models • Select the appropriate data mining tools for applications. (b) Content and instructional methods: The presentation provides interaction between the participant and the material by involving audience participation in the data analyses.

About the Instructor

Richard De Veaux (Dick), Ph.D is C. Carlise and Margaret Tippit Professor of Statistics at Williams College. He holds degrees in Civil Engineering (B.S.E. Princeton), Mathematics (A.B. Princeton), Dance Education (M.A.) and Statistics (Ph.D.) at Stanford where he studied statistics with Persi Diaconis and dance with Inga Weiss. Dick has taught at the Wharton School and Princeton Universityand has been a visiting researcher at INRA in Montpellier and a visiting professor at Paris V. De Veaux has won numerous teaching awards from the Engineering Council at Princeton. He has won both the Wilcoxon and Shewell (twice) awards from the ASQ is a fellow of the ASA and an elected member of the ISI. In 2006-2007 he was the William R. Kenan Jr. Visiting Professor for Distinguished Teaching at Princeton. In 2008 he was named the Statistician of the Year by the Boston Chapter of the ASA. He has served on the Board of Directors of the ASA and is past chair of the Section on Statistical Learning and Data Science and is the 2019-2021 Vice President. Dick has been a consultant for over 30 years for such Fortune 500 companies as Hewlett-Packard, Alcoa, American Express, Bank One,GlaxoSmithKline.

Relevance to Conference Goals

Directly relevant to themes of practical issues in big data.

Thu, Feb 14
8:00 AM - 12:00 PM
Magazine

SC4 - Bootstrap Methods and Permutation Tests Fill out evaluation
Short Course (half day)

Instructor(s): Tim C. Hesterberg, Google

We begin with a graphical approach to bootstrapping and permutation testing, illuminating basic statistical concepts of standard errors, confidence intervals, p-values and significance tests.

We consider a variety of statistics (mean, trimmed mean, regression, etc.), and a number of sampling situations (one-sample, two-sample, stratified, finite-population), stressing the common techniques that apply in these situations. We'll look at applications from a variety of fields, including telecommunications, finance, and biopharm.

These methods let us do confidence intervals and hypothesis tests when formulas are not available. This lets us do better statistics, e.g. use robust methods (we can use a median or trimmed mean instead of a mean, for example). They can help clients understand statistical variability. And some of the methods are more accurate than standard methods.

Outline & Objectives

Introduction to Bootstrapping
General procedure
Why does bootstrapping work?
Sampling distribution and bootstrap distribution

Bootstrap Distributions and Standard Errors
Distribution of the sample mean
Bootstrap distributions of other statistics
Simple confidence intervals
Two-sample applications

How Accurate Is a Bootstrap Distribution?

Bootstrap Confidence Intervals
Bootstrap percentiles as a check for standard intervals
More accurate bootstrap confidence intervals

Significance Testing Using Permutation Tests
Two-sample applications
Other settings

Wider variety of statistics
Variety of applications
Examples where things go wrong, and what to look for

Wider variety of sampling methods
Stratified sampling, hierarchical sampling
Finite population
Regression
Time series

Participants will learn how to use resampling methods:
* to compute standard errors,
* to check the accuracy of the usual Gaussian-based methods,
* to compute both quick and more accurate confidence intervals,
* for a variety of statistics and
* for a variety of sampling methods, and
* to perform significance tests in some settings.

About the Instructor

Dr. Tim Hesterberg is a Senior Data Scientist at Google. He previously worked at Insightful (S-PLUS), Franklin & Marshall College, and Pacific Gas & Electric Co. He received his Ph.D. in Statistics from Stanford University, under Brad Efron.

Hesterberg is author of the "Resample" package for R and primary author of the "S+Resample" package for bootstrapping, permutation tests, jackknife, and other resampling procedures, is co-author of Chihara and Hesterberg "Mathematical Statistics with Resampling and R" (2011), and is lead author of "Bootstrap Methods and Permutation Tests" (2010), W. H. Freeman, ISBN 0-7167-5726-5, and technical articles on resampling. See http://www.timhesterberg.net/bootstrap.

Hesterberg is on the executive boards of the National Institute of Statistical Sciences and the Interface Foundation of North America (Interface between Computing Science and Statistics).

He teaches kids to make water bottle rockets, leads groups of high school students to set up computer labs abroad, and actively fights climate chaos.

Relevance to Conference Goals

Resampling methods are important in statistical practice, but are omitted or poorly covered in many old-style statistics courses. These methods are an important part of the toolbox of any practicing statistician.

It is important when using these methods to have some understanding of the ideas behind these methods, to understand when they should or should not be used.

They are not a panacea. People tend to think of bootstrapping in small samples, when they don't trust the central limit theorem. However, the common combinations of nonparametric bootstrap and percentile intervals is actually accurate than t procedures. We discuss why, remedies, and better procedures that are only slightly more complicated.

These tools also show how poor common rules of thumb are -- in particular, n >= 30 is woefully inadequate for judging whether t procedures should be OK.

Thu, Feb 14
8:00 AM - 12:00 PM
Jackson

SC5 - Communicating Data Clearly Fill out evaluation
Short Course (half day)

Instructor(s): Naomi B. Robbins, NBR

Download Handouts

Communicating Data Clearly describes how to draw clear, concise, accurate graphs that are easier to understand than many of the graphs one sees today. The course emphasizes how to avoid common mistakes that produce confusing or even misleading graphs. Graphs for one, two, three, and many variables are covered as well as general principles for creating effective graphs.

Outline & Objectives

This course begins by reviewing human perception and our ability to decode graphical information. It continues by:

• Ranking elementary graphical perception tasks to identify those that we do the best.

• Showing the limitations of many common graphical constructions.

• Demonstrating newer, more effective graphical forms developed on the basis of the ranking.

• Providing general principles for creating effective graphs.

• Commenting on software packages that produce graphs.

• Comparing the same data using different graph forms so the audience can see how understanding depends on the graphical construction used.

• Discussing Trellis Display (a framework for the visualization of multivariate data) and other innovative methods for presenting more than two variables.

• Presenting some graphical methods for categorical data.

Since scales (the rulers along which we graph the data) have a profound effect on our interpretation of graphs, the section on general principles contains a detailed discussion of scales.

The course concludes with before and after examples that reinforce the topics covered.

About the Instructor

Naomi B. Robbins is a consultant and seminar leader who specializes in the graphical display of data. She offers keynotes, short courses and workshops to train employees of corporations and organizations on the effective presentation of data. She also reviews documents and presentations for clients, suggesting improvements or alternative presentations as appropriate. She is the author of Creating More Effective Graphs, published by Chart House (2013). Dr. Robbins has been the keynote speaker at international conventions and has spoken on graphs to universities, professional societies, corporations, and non-profits. She received her Ph.D. in mathematical statistics from Columbia University, M.A. from Cornell University, and A.B. from Bryn Mawr College. She had a long career at Bell Laboratories before forming NBR, her consulting practice. Naomi was chair of the Statistical Graphics Section of the American Statistical Association and is the organizer of the Data Visualization New York Meetup.

Relevance to Conference Goals

Attendees will be exposed to graphical techniques, some of which may be new to them. Ideas covered are immediately applicable.

The entire emphasis of the course is to use best graphical practices to communicate quantitative information better.

Effective charts and graphs and understanding data better lead to better decisions which have a positive impact on the company. Communicating data better saves time at meetings.

Better communication of data enhances one’s career and avoids the loss of credibility that comes with using confusing, misleading or deceptive figures.

Thu, Feb 14
1:30 PM - 5:30 PM
Magazine

SC6 - Structural Equation and Multilevel Modeling Approaches to Examining Change Over Time Fill out evaluation
Short Course (half day)

Instructor(s): Kevin John Grimm, Arizona State University

This half day workshop discusses growth models from the multilevel and structural equation modeling perspectives. Growth models have become a mainstay of longitudinal data analysis in the social and behavioral sciences to examine how individuals change over time and how individuals differ in their change process. The workshop covers several introductory topics that range from linear and nonlinear growth models to the inclusion of time-invariant and time-varying covariates. For analysis, we will discuss and use the structural equation modeling and multilevel modeling frameworks available through R and Mplus. The training is intended for faculty, postdocs and advanced graduate students who are familiar with structural equation modeling and multilevel modeling.

Outline & Objectives

The objectives of this full day workshop are to (1) understand the uniqueness of longitudinal data and the challenges of modeling individual change over time, (2) estimate linear and nonlinear growth models using R and Mplus, (3) interpret model parameters and their importance, and (4) estimate models with time-invariant and time-varying covariates while distinguishing between within- and between-person effects.

About the Instructor

Kevin J. Grimm, Ph.D., is a Professor in the Department of Psychology at Arizona State University, where he teaches classes on the analysis of variance, longitudinal growth modeling, machine learning, and structural equation modeling. He received his B.A. in Mathematics and Psychology with a concentration in Education from Gettysburg College in 2000, his M.A. and Ph.D. in Psychology from the University of Virginia (2001-2006). His research interests include longitudinal methodology, exploratory data analysis, and data integration, especially the integration of longitudinal studies. His recent research has focused on nonlinearity in growth models, growth mixture models, extensions of latent change score models, and approaches for analyzing change with limited dependent variables. Dr. Grimm directs the American Psychological Association’s Advanced Training Institutes on Structural Equation Modeling in Longitudinal Research and Big Data: Exploratory Data Mining.

Relevance to Conference Goals

This workshop is in line with the goals of the conference for Data Modeling and Analysis and well as Communication. This workshop will engage the audience to consider the various possibilities of modeling longitudinal data, and be able to communicate their findings to wider audiences.

Thu, Feb 14
1:30 PM - 5:30 PM
Jackson

SC7 - How to Best Use Analytical Skills as a Statistician to Influence Quantitative Decision-Making Fill out evaluation
Short Course (half day)

Instructor(s): Achim Guettner, Novartis Pharma AG; Peter Grant Mesenbrink, Novartis Pharmaceuticals Corporation

Download Handouts

Decisions on projects are often not made only from algorithms or based on the recommendation of statisticians. The importance of well-balanced technical and non-technical skills is essential for a statistician to be successful in collaborating and leading multi-disciplinary projects. With a well-balanced skill set, statisticians have the opportunity to use their analytical abilities to the fullest extent. However, to maximize the use of these skills, this requires statisticians to move outside of their comfort zone in order to excel in the leadership of cross-functional teams, to demonstrate strong communication and collaboration skills and to manage the conflicts that may occur when facing challenges outside of the realm of statistics.

Outline & Objectives

This short course will focus on providing statisticians with the guidance on the non-technical skills that they need to develop expertise to be successful in quantitative decision making when working with non-statisticians. The first half of the course will focus on providing guidance on the best practices for statisticians to be successful with their oral and written communication through case studies of real world scenarios from work as a statistician operating in cross-functional teams. The scope of these topics will include coverage of: active listening, asking the right questions, using the right vocal tones for the situation, networking, emotional intelligence, self-awareness of the surround environment, and receiving and providing further feedback. References to further reading and online material will be given. The second half of the short-course will focus on how statisticians can make best use of their analytical skills and become successful cross-functional leaders. Time will also be spent on understanding how to best handle conflicts and how to use analytical skills to win the right battles that statisticians face on a daily basis.

About the Instructor

The two speakers have more than 40 years of experience combined as statisticians in the pharmaceuticals industry. Dr. Mesenbrink has been an active spokesperson for the Leadership Initiative within the American Statistical Association while Dr. Guettner is leading an initiative within Novartis on leadership and soft skill development for statisticians. In addition to external publications and presentations on the subject matter, Dr. Mesenbrink is currently finishing the writing of the book: How to be a Successful Biostatistician in Industry for CRC Press which is projected to be published by the end of 2018.

Relevance to Conference Goals

As a meeting that is intended to help statisticians obtain practical needed for them to grow in their careers, this short course is aligned with the conference goals to provide statistician with practical knowledge needed to help them to continue grow and expand their potential career paths.

Thu, Feb 14
5:30 PM - 7:00 PM
St. James Ballroom

Exhibits Open
Exhibits

Thu, Feb 14
5:30 PM - 7:00 PM
St. James Ballroom

PS1 - Poster Session 1 and Opening Mixer
Poster Session

Chair(s): Cate Knockenhauer, Conagra

Statistical Practice Around the World
View Presentation Eric A. Vance, LISA-University of Colorado Boulder

Improving Reproducibility via Experimental Design in the Life and Social Sciences
View Presentation Ben George Fitzpatrick, Loyola Marymount University

Mission Statements: What They Do and How to Use Them
View Presentation Kirsten Elise Eilertson, The Pennsylvania State University

Statistical Process Control in Manufacturing: A Collaborative Effort to Reduce Costs Associated with Batched Processes
View Presentation Kelsea Cox, The Boeing Company

A Statistical Design of Experiments Approach to Address Uncertainty in Human Error Modeling in Security Screening Operations
View Presentation Andrea Trevino Gavito, Northwestern University

Bayesian Technique for Relating Genetic Polymorphisms to Diffusion Tensor Images of Cocaine Users Brains
View Presentation Tmader Alballa, Virginia Commonwealth University

Cost-Effectiveness Analysis Applied to a Randomized Trial of Remotely Delivered Depression Treatments
View Presentation Elizabeth L. Gray, Northwestern University

Multiple Imputation for Noninferiority Clinical Trials
View Presentation Brooke A. Rabe, The University of Arizona

Analysis of Longitudinal Data Using B-Splines in R and SAS
View Presentation Margo A. Sidell, Kaiser Permanente Southern California

Textual Analysis of Course Evaluations
View Presentation Caitlin Mary Cunningham, Le Moyne College

AI-Enhanced Innovations in Large National Health Care Survey Data Analytics
View Presentation Steven B. Cohen, RTI International

A Practical Guide of Propensity Score Analysis for Longitudinal Observational Study
View Presentation Chencan Zhu, Stony Brook University

Using Factor Analysis to Help Educators Evaluate Psychosocial Interventions
View Presentation Letisha Smith, New York University

Lessons Learned from Using a Cloud-Based Data Source with OMOP Data
View Presentation Andrew Hammes, University of Colorado, Denver

Using Google APIs to Automate Data Extraction for Sampling Frames
View Presentation Adam Lee, ICF

Evaluation of the Effect of Well Parameters on Oil Production
View Presentation Marshal E. Wigwe, Texas Tech University

Workflow for Training and Tuning Imputation and Prediction Model Pairings
View Presentation Milo Tyrus Page, JMP

Improve Clinical Reports with Something 'Shiny': Flexdashboards for Interactive Enrollment Tracking
View Presentation Michael Golafshar, Mayo Clinic

SAS Tasks for Everyone!
View Presentation John Stephen Taylor, Veristat

Friday, February 15

Fri, Feb 15
7:30 AM - 5:30 PM
3rd Floor Registration Counter S

Registration
Registration

Fri, Feb 15
7:30 AM - 6:30 PM
St. James Ballroom

Exhibits Open
Exhibits

Fri, Feb 15
7:30 AM - 8:30 AM
St. James Ballroom

Continental Breakfast
Other

Fri, Feb 15
8:00 AM - 9:00 AM
St. Charles

GS1 - Keynote Address Fill out evaluation
General Session

Chair(s): Eric A. Vance, LISA-University of Colorado Boulder

What Is Data Science?
Hadley Wickham, RStudio

Fri, Feb 15
9:15 AM - 10:45 AM
St. Charles

CS01 - Storytelling Fill out evaluation
Concurrent Session

Chair(s): Emily M. Slade, University of Kentucky

9:20 AM

Becoming a Master Storyteller and Influencer
View Presentation Bud Sanders, Strategic Oversight Services, Inc.

10:05 AM

Telling the Story of Your Stats
Jennifer Van Mullekom, Virginia Tech

Fri, Feb 15
9:15 AM - 10:45 AM
Canal

CS02 - Survey Considerations and Adjustments Fill out evaluation
Concurrent Session

Chair(s): Jay Mandrekar, Mayo Clinic

9:20 AM

Surveys and Big Data for Estimating Brand Lift
Tim C. Hesterberg, Google

10:05 AM

Adjusted Seasonal Moving Average Model for the Current Population Survey
View Presentation James Lawrence, US Census Bureau

Fri, Feb 15
9:15 AM - 10:45 AM
Jackson

CS03 - Foundations in Data Science Fill out evaluation
Concurrent Session

Chair(s): Jana Anderson, Colorado State University

9:20 AM

Introduction to Data Science Using Python
View Presentation JeAnna Lanza Abbott, University of Houston

10:05 AM

Actuarial-Data Science Collaboration: Key Methods and Applications
View Presentation Louise Francis, FCAS, MAAA, CSPA, Francis Analytics and Actuarial Data Mining, Inc.

Fri, Feb 15
9:15 AM - 10:45 AM
Magazine

CS04 - Extending Existing Tools with Applications and Languages Fill out evaluation
Concurrent Session

Chair(s): Eric Tesdahl, SpecialtyCare

9:20 AM

Estimation and Simulation for a Nonhomogeneous Poisson Process
View Presentation Lawrence Mark Leemis, The College of William & Mary

10:05 AM

APPL: A Probability Programming Language
View Presentation Lawrence Mark Leemis, The College of William & Mary

Fri, Feb 15
11:00 AM - 12:30 PM
St. Charles

CS05 - Let's Talk Statistics Fill out evaluation
Concurrent Session

Chair(s): Julia L. Sharp, Colorado State University

11:05 AM

Explaining Three Complex Statistical Concepts to Nonstatisticians
View Presentation Karen Grace-Martin, The Analysis Factor, LLC

11:50 AM

Explaining Competing Risk Analyses to Nonstatisticians
View Presentation Cynthia S. Crowson, Mayo Clinic

Fri, Feb 15
11:00 AM - 12:30 PM
Canal

CS06 - Contemporary Methods in Health and Medicine Fill out evaluation
Concurrent Session

Chair(s): Sana N. Charania, CDC

11:05 AM

A Modified PageRank Algorithm for Pathway Prioritization Based on Projection Correlation and Generalized R-Squared
Qingyang Zhang, University of Arkansas

11:50 AM

Easy Estimation of Dynamic Treatment Regimes: A Primer on Personalized Medicine
View Presentation Michael Patrick Wallace, University of Waterloo

Fri, Feb 15
11:00 AM - 12:30 PM
Jackson

CS07 - Getting Ready for Your Research Fill out evaluation
Concurrent Session

Chair(s): Billy Bridges, Clemson

11:05 AM

Better Research Planning Through Simulation
View Presentation Neal Fultz, Factual

11:50 AM

Reproducible Research: Best Practices and Tools for Support
View Presentation JeAnna Lanza Abbott, University of Houston

Fri, Feb 15
11:00 AM - 12:30 PM
Magazine

CS08 - Special Applications of Statistical Software Fill out evaluation
Concurrent Session

Chair(s): Roxy Cramer, Rogue Wave Software, Inc.

11:05 AM

Computational Advances in the Production of Compact Letter Displays
View Presentation John M. Ennis, The Institute for Perception

11:50 AM

Helper Functions for Economic Indexes and Productivity Statistics
View Presentation Michael Jadoo, Bureau of Labor Statistics

Fri, Feb 15
12:30 PM - 2:00 PM

Lunch (On Own)
Other

Fri, Feb 15
2:00 PM - 3:30 PM
St. Charles

CS09 - From Problem to Project Fill out evaluation
Concurrent Session

Chair(s): Christine Luketic, Virginia Tech

2:05 PM

Back to the Basics: Framing and Organizing Your Problem
Ed Marx, Vanderbilt University Medical Center

2:50 PM

Developing a Network of International Statistical Laboratories
View Presentation Eric A. Vance, LISA-University of Colorado Boulder; Kim Love, K. R. Love Quantitative Consulting and Collaboration

Fri, Feb 15
2:00 PM - 3:30 PM
Canal

CS10 - Addressing Problematic Data Fill out evaluation
Concurrent Session

Chair(s): Xinling (Claire) Xu, Beth Israel Deaconess Medical Center

2:05 PM

Dealing with Missing Data in a Multi-Item ICU-Mortality Scale: A Comparison of Multiple Imputation Methods
View Presentation Chia-Ling Kuo, University of Connecticut Health

2:50 PM

Detection of Latent Heteroscedasticity and Group-Based Effects in Linear Models via Bayesian Model Selection
View Presentation Thomas Anthony Metzger, Virginia Tech

Fri, Feb 15
2:00 PM - 3:30 PM
Jackson

CS11 - Panel on Ethics in Data Science and Statistics Fill out evaluation
Concurrent Session

Chair(s): Wendy Martinez, Bureau of Labor Statistics

2:05 PM

Panel on Ethics in Data Science and Statistics
David J. Corliss, Peace-Work; Juan Lavista Ferres, Microsoft; Mary Gray, American University

Fri, Feb 15
2:00 PM - 3:30 PM
Magazine

CS12 - Working Efficiently in R Fill out evaluation
Concurrent Session

Chair(s): Duke Butterfield, Mayo Clinic

2:05 PM

Useful Tricks in R
View Presentation Paul Teetor, William Blair & Co.

2:50 PM

Automated Building and Storing Frozen Data in R Packages Using Travis and Drat
View Presentation Ben Joseph Barnard, Wells Fargo

Fri, Feb 15
3:45 PM - 5:15 PM
St. Charles

CS13 - Communicating with Stakeholders Fill out evaluation
Concurrent Session

Chair(s): Layla Guyot, Texas State University

3:50 PM

Captain Obvious: Obvious Data Issues That Matter to Our Clients
View Presentation Georgette Asherman, Direct Effects, LLC

4:35 PM

'Don’t Tell the Client We’re Idiots' and Other Working Mottoes of a Biostatistical Contract Research Organization: Tips for the Consulting Statistician
View Presentation Kent Koprowicz, Axio Research

Fri, Feb 15
3:45 PM - 5:15 PM
Canal

CS14 - Detection and Diagnostic Methods Fill out evaluation
Concurrent Session

Chair(s): Cynthia S. Crowson, Mayo Clinic

3:50 PM

Signal Detection in Medical Devices Post-Market Surveillance: From Raw Data to Signals Using R
View Presentation Gary Chung, Johnson & Johnson

4:35 PM

Practical Guidelines for Instrumental Variables Methods
View Presentation Luke J. Keele, University of Pennsylvania

Fri, Feb 15
3:45 PM - 5:15 PM
Jackson

CS15 - Inference with Big Data Fill out evaluation
Concurrent Session

Chair(s): Qingyang Zhang, University of Arkansas

3:50 PM

Using Recursive Variable Selection in Conditional Random Forests to Construct an Overall Seasonality Test
View Presentation Daniel Ollech, Deutsche Bundesbank

4:35 PM

A Practical Assessment of the Sensitivities of Bayesian Model Selection
View Presentation Christopher T. Franck, Virginia Tech

Fri, Feb 15
3:45 PM - 5:15 PM
Magazine

CS16 - Population Health Analytics Fill out evaluation
Concurrent Session

Chair(s): Natasha Hurwitz, National Institutes of Health

3:50 PM

The National Center for Health Statistics Guidelines for the Analysis of Trends
Renee Gindi, National Center for Health Statistics

4:35 PM

Global Health, Conflicted Data, and GPS: Analyzing a Gender-Based Violence Intervention in Nairobi, Kenya
View Presentation Rina Friedberg, Stanford University

Fri, Feb 15
5:15 PM - 6:30 PM
St. James Ballroom

PS2 - Poster Session 2 and Refreshments
Poster Session

Chair(s): Fabio D'Ottaviano, The Dow Chemical Company

Effective Communication for Statistical Consulting
View Presentation Danielle Guffey, Baylor College of Medicine

Strategies for Guiding Student Collaborators in Providing Productive Feedback on Scientific Work
View Presentation Julia L. Sharp, Colorado State University

Is Value at Risk a Suitable Tool for Financial Professionals and Their Clients?
View Presentation Michael William Kotarinos, University of South Florida, Solarbeam Capital LLC

Dimension Reduction in Bankruptcy Prediction: A Case Study of North-American Companies
View Presentation Edward Alexander Golas, Bryant University

Applying Machine Learning and Statistical Methods in the Predicting Client’s Phone Call Activity
View Presentation Xinghe Lu, The Vanguard Group

Modeling Malaria Prevalence: Comparison of Multiple Spatial and Spatio-Temporal Models
View Presentation Ben Toh, University of Florida

Development of Improved Statistical Methodology for Eyewitness Identification
View Presentation Alice J. Liu, University of Virginia

Finding Simultaneous Tolerable Dosage Combinations for Multiple Endnotes
View Presentation Faten Saeed Alamri, Virginia Commonwealth University & Princess Nourah Bint Abdulrahman University

Determining Changepoints in Longitudinal Cardiovascular Health Scores Using Segmented Mixed Models
View Presentation Amy Elizabeth Krefman, Feinberg School of Medicine, Northwestern University

The Effect of Sampling Methods on Machine Learning Models for Predicting Long-Term Length of Stay: A Case Study for Rhode Island Hospitals
View Presentation Son Nguyen, Bryant University

Statistical Analysis of Large-Scale Public Transport Data
View Presentation Daniel Joseph Graham, Imperial College London

Using Penalized Regression to Identify Subgroups of Clients Who Benefit from a Diabetes Intervention
View Presentation Brandy R. Sinco, University of Michigan

Transcytosis as a Mechanism of HIV Entry into Endocervical Tissue: Evaluating Data in Stages
View Presentation Angela Fought, Northwestern University

Coherent Multi-Task Feature Selection and Prediction from Pharmacogenomics Databases
Alahendra A. Chamila Dilhani Perera, Texas Tech University

Fitting Skewed Data into Asymmetric Distribution
Nofiu Idowu Badmus, Yaba College of Technology

New GARCH Methods for Financial Analysts: Using Industry Knowledge to Develop Better Forecasts for Common Financial Products
View Presentation Vishruti Ganesh, Dougherty Valley High School

Reducing Misclassification Rate Through Hybrid Algorithm, with Application on Consumer Feedback Data
View Presentation Shankang Qu, PepsiCo

A Comparison of Random Forest Variable Selection Methods for Classification Modeling
View Presentation Jaime Lynn Speiser, Wake Forest University School of Medicine

Data-Driven Programming Techniques in SAS and R
View Presentation Davia Moyse, ICF

Visual Exploration of Big Biological Data
View Presentation Lauren Holt Lenz, Utah State University

Creating 88 Dynamic Tables with Stata's Putexcel Function
View Presentation Jason Andrew Benedict, OSU Center for Biostatistics

Saturday, February 16

Sat, Feb 16
7:30 AM - 2:30 PM
3rd Floor Registration Counter S

Registration
Registration

Sat, Feb 16
7:30 AM - 1:00 PM
St. James Ballroom

Exhibits Open
Exhibits

Sat, Feb 16
8:00 AM - 9:15 AM
St. James Ballroom

PS3 - Poster Session 3 and Continental Breakfast
Poster Session

Chair(s): Charles Minard, Baylor College of Medicine

Building Effective Research Teams: Bridging the Gap Between Project Managers and Technical Specialists
Richard Lee Harding, ICF

Crime in the Wake of Hurricane María: A Spatio-Temporal Analysis of Crime Patterns in Puerto Rico Before and After a Natural Disaster
View Presentation Lorena Belise Hernandez, University of Puerto Rico Medical Sciences Campus

Don't Count on Poisson: COM-Poisson Models for Count Data
View Presentation Darcy Steeg Morris, US Census Bureau

Benefits of Using Item Response Theory with Unvalidated Questionnaire Data
View Presentation Kristen Staggers, Baylor College of Medicine

Modeling the Success of Kickstarter Projects
View Presentation George Agyeah, Western Illinois University

Survey of Methods for Dynamic Prediction of Graft Survival in Kidney Transplant Patients
View Presentation Matthew Buras, Mayo Clinic AZ Department of Biostatistics

A Sampling Strategy to Efficiently Estimate Proportions or Secondary Information About Objects of Interest in a Clustered, Rare Population and Accompanying R Package
View Presentation Kristen Erica Sauby, Lighthouse Statistical and Ecological Consulting, LLC

Predictive Models of Health Care Expenditure: Penalized Regression Approaches Among All Level Income Economies
View Presentation Emmanuel Thompson, Southeast Missouri State University

Investigating the Underlying Structure of Particulate Matter Concentrations
View Presentation Eduardo Montoya, California State University, Bakersfield

A Maximum Likelihood Method for Correlated Discrete and Continuous Outcomes with Selection and Lagged Effects: Quantifying the Role of Past Season Yield on Improved Maize Seed Adoption
View Presentation Rhoda Nandai Muse, University of Arizona

A Practical Approach to Evaluate the Change in Precision Statistics
View Presentation Ming Lu, Avid Radiopharmaceuticals, Inc.

Survival Associated with Pharmaceutical Drug Exposures Following Cancer Diagnoses: An Analysis of Pharmacoepidemiologic Evidence and Bias
View Presentation Scott W. Keith, Thomas Jefferson University

Game Data Mining: How Data Science Transforms the Game Industry
View Presentation Qiaolin Chen, Tencent

Extracting Information from Clinical Notes Using Natural Language Processing
View Presentation Qing Zhou, Kaiser Permanente Center for Health Research

Developing and Deploying Reproducible In-House R Packages: A Non-Packrat Workflow
View Presentation Eric Tesdahl, SpecialtyCare

The Data Detective's Toolkit
View Presentation Kim Chantala, RTI International

Sat, Feb 16
9:15 AM - 10:45 AM
Camp

CS17 - Getting Connected Fill out evaluation
Concurrent Session

Chair(s): Caitlin Mary Cunningham, Le Moyne College

9:20 AM

Constructive Conversations Getting Unstuck by Being Supportively Outspoken
JeAnna Lanza Abbott, University of Houston

10:05 AM

Finding Your Voice: Improving Your Communication Skills with Toastmasters
Mohamad Qayoom, Toastmasters International

Sat, Feb 16
9:15 AM - 10:45 AM
Canal

CS18 - Business Applications Fill out evaluation
Concurrent Session

Chair(s): Ella Revzin, Precima

9:20 AM

Confidence Intervals for Ratios: Econometric Examples with Stata
View Presentation Joseph Hirschberg, University of Melbourne

10:05 AM

Statistical Analysis and Modeling in Evaluation of Effectiveness of the Reliability Standards for the North American Bulk Power System
View Presentation Svetlana Ekisheva, North American Electric Reliability Corporation

Sat, Feb 16
9:15 AM - 10:45 AM
Jackson

CS19 - Overcoming Challenges in Classification Problems Fill out evaluation
Concurrent Session

Chair(s): Birol Emir, Pfizer Inc.

9:20 AM

Classifying Risks for Underwriters: A Case Study in Risk Classification using Unsupervised Learning.
Michael Regier, Verisk Analytics, ISO

10:05 AM

Propensity Score Stratification: Can We Do Better?
View Presentation Roland Albert Matsouaka, Duke University

Sat, Feb 16
9:15 AM - 10:45 AM
Magazine

CS20 - Creating Effective Visualizations Fill out evaluation
Concurrent Session

Chair(s): Naomi B. Robbins, NBR

9:20 AM

Advanced Visualization Techniques for Big Data
View Presentation Scott Lee Wise, SAS Institute Inc. (JMP Business Division)

10:05 AM

Easy Ways to Make Data Visualizations More Effective
View Presentation Sara Richter, Professional Data Analysts, Inc.

Sat, Feb 16
11:00 AM - 12:30 PM
Camp

CS21 - Student to Statistician Fill out evaluation
Concurrent Session

Chair(s): Kayéromi Gomez, University of Illinois College of Medicine

11:05 AM

Transition from Education to Profession: Experience of Statisticians
View Presentation Layla Guyot, Texas State University

11:50 AM

Good Statistical Practice: Magic or Process?
Jennifer Van Mullekom, Virginia Tech

Sat, Feb 16
11:00 AM - 12:30 PM
Canal

CS22 - Behind the Model: Modeling Approaches and Strategies Fill out evaluation
Concurrent Session

Chair(s): Steven B. Cohen, RTI International

11:05 AM

Solving BioEngineering Problems Using Predictive Modeling and Machine Learning Approaches
View Presentation Ahmad M. Haider, AGCO Corporation

11:50 AM

What Are We Modeling For? A Case for a Holistic Analytic Approach
View Presentation Vasile Alexandru Suchar, Whitman College

Sat, Feb 16
11:00 AM - 12:30 PM
Jackson

CS23 - Predictive Analysis Fill out evaluation
Concurrent Session

Chair(s): Raja Velu, Syracuse University

11:05 AM

Predicting Quality in Industrial Production Processes
View Presentation Adalbert Franz Xaver Wilhelm, Jacobs University Bremen

11:50 AM

Targeting Return-to-Work Intervention by Predicting Prolonged Workers' Compensation Claim
Mei Najim, Advanced Analytics Consulting Services, LLC

Sat, Feb 16
11:00 AM - 12:30 PM
Magazine

CS24 - Parametric-Independent Methods Fill out evaluation
Concurrent Session

Chair(s): Sejong Bae, University of Alabama at Birmingham

11:05 AM

How to Get the Most Out of Your Correlation Analysis
View Presentation Stephen Warwick Looney, Augusta University

11:50 AM

On Missing Random Effects in Machine Learning
View Presentation Fabio D'Ottaviano, The Dow Chemical Company

Sat, Feb 16
12:30 PM - 2:00 PM

Lunch (On Own)
Other

Sat, Feb 16
2:00 PM - 4:00 PM
Camp

PCD1 - Introduction to Structural Equation Modeling Using Stata Fill out evaluation
Practical Computing Demo

Instructor(s): Chuck Huber, StataCorp

This workshop introduces the concepts and jargon of structural equation modeling (SEM) including path diagrams, latent variables, endogenous and exogenous variables, and goodness of fit. I will describe the similarities and differences between Stata's -sem- and -gsem- commands. Then I demonstrate how to fit many familiar models such as linear regression, multivariate regression, logistic regression, confirmatory factor analysis, and multilevel models using -sem- and -gsem-. I conclude demonstrating how to fit structural equation models that contain both structural and measurement components.

Outline & Objectives

Participants will learn about the following concepts and tools:
Observed and latent variables
Exogenous and endogenous variables
Recursive and nonrecursive models
Model assumptions
Checking the fit of a structural equation model
How to draw a path diagram using Stata’s SEM Builder
How to use Stata’s -sem- command syntax
How to use Stata’s -gsem- command syntax
Differences and similarities between -sem- and -gsem-
How to fit structural equation models by group
How to constraint model parameters
How to fit a mediation model using SEM
How to estimate descriptive statistics such as sample means, variance, and correlation with SEM
How to fit familiar models such as linear and logistic regression using SEM
How to fit confirmatory factor analysis (CFA) models using SEM

About the Instructor

Chuck Huber is a Senior Statistician at StataCorp and Adjunct Associate Professor of Biostatistics at the Texas A&M School of Public Health. In addition to working with Stata's team of software developers, he produces instructional videos for the Stata YouTube channel, writes blog entries, develops online NetCourses and gives talks about Stata at conferences and universities. Most of his current work is
focused on statistical methods used by psychologists and other behavioral scientists. He has
published in the areas of neurology, human and animal genetics, alcohol and drug abuse prevention, nutrition and birth defects. Dr. Huber currently teaches introductory biostatistics at Texas A&M where he previously taught categorical data analysis, survey data analysis, and statistical genetics.

Relevance to Conference Goals

Structural equation modeling has become increasingly popular for modeling the interrelationships among a group of variables. Many researchers us SEM to understand causal relationships in complex systems. This talk introduces this powerful tool using the popular statistical package Stata.

Sat, Feb 16
2:00 PM - 4:00 PM
Jackson

PCD2 - Interfacing R with Excel in Two Different Ways Fill out evaluation
Practical Computing Demo

Instructor(s): Jean Paul Maalouf, Addinsoft XLSTAT

Download Handouts

Thanks to its popularity and user-friendly environment, Microsoft Excel is widely used to gain data insights and make better decisions. However, compared to mainstream statistical software such as R, Excel lacks advanced statistical tools taken solely or integrated into procedures. On the other hand, R is a coding software associated to a steep learning curve. In order to interface the unlimited statistical possibilities of R with the user-friendly environment of Excel, two features have recently been developed within the XLSTAT software: 1) XLSTAT-R helps programmers develop user-friendly dialog boxes in Excel allowing users to launch customized R procedures directly on data selected in Excel with their mouse. 2) The XLSTAT-RNotebook allows writing R code in Excel cells with the possibility of capturing data in the form of Excel cell ranges. The outputs are also displayed in Excel. This makes it possible to create complex dashboards or reports in Excel made from R code. The created procedures can then be used by colleagues, students or clients who don’t necessarily know how to code. This tutorial shows how developers can build customized R procedures in an Excel dialog box or directly in Excel cells using XLSTAT.

Basic coding skills are required (preferably R).

Outline & Objectives

Outline:
1. Introduction to XLSTAT-R and the XLSTAT-RNotebook.
2. Application: Making the pam{cluster} R function available in an Excel dialog box and adding the possibility to customize several options and charts from within the dialog box.
3. Application: Developing a customized R-based dashboard in an Excel sheet using the XLSTAT-RNotebook.

Objectives:
At the end of this tutorial, participants will understand the basics of XLSTAT-R or the XLSTAT-RNotebook, used to develop R-based statistical applications or dashboards in Excel.

About the Instructor

Jean Paul Maalouf (PhD) is an independent statistical consultant with 10 years of experience. He has worked for 4 years at Addinsoft as the brand manager of the XLSTAT Software, leader in statistical software for Excel. He substantially contributed to the development of the XLSTAT-R engine and has created many of the default XLSTAT-R procedures included in XLSTAT solutions.

Relevance to Conference Goals

The open-source R software is known for its steep learning curve. Data-inspired decision makers often prefer relying on dashboards or user-friendly environments such as Microsoft Excel. This tutorial shows how data science, data analysis and modeling procedures built in R can be made available to any Excel user thanks to XLSTAT-R and the XLSTAT-RNotebook. These developments are possible under different collaboration scenarios. Chief programming statisticians are able to customize applications for decision makers. Consultants are able to set up Excel applications tailored to the specific needs of their customers. Professors are able to develop customized statistical programs in Excel to illustrate their courses.

Sat, Feb 16
2:00 PM - 4:00 PM
Royal

T1 - The Role of Propensity Score Methods in Data Science Fill out evaluation
Tutorial

Instructor(s): Adin-Cristian Andrei, Northwestern University

Download Handouts

Increasingly complex observational studies are commonplace in a numerous data science settings, including biomedical, health services, pharmaceutical, insurance and online advertising. To adequately estimate causal effect sizes, proper control of known potential confounders is critical. Having gained enormous popularity in the recent years, propensity score methods are powerful and elegant tools for estimating causal effects. Without assuming prior knowledge of propensity score methods, this short course will use simulated and real data examples to introduce and illustrate important techniques involving propensity scores, such as weighting, matching and sub-classification. Relevant R and SAS software packages for implementing data analyses will be discussed in detail. Specific topics to be covered include guidelines on how to construct a propensity score model, create matched pairs for binary group comparisons, assess baseline covariate balance after matching and use inverse propensity score weighting techniques. Illustrative examples will accompany each topic and a brief review of recent relevant developments and their implementation will also be discussed.

Outline & Objectives

Outline:

- Observational Studies: definition, examples, causal effects, confounding control.
- Propensity Scores: definition, properties, modeling techniques.
- Propensity Score Approaches in Observational Studies: weighting, matching, sub-classification; graphical methods to assess covariate balance after matching;
- Illustration of these techniques using R packages MatchIt, Matching and optmatch, as well as SAS PROCs CAUSALTRT and PSMATCH.
- Guidelines on how to best describe the methodology utilized and the results obtained when presenting to a non-technical audience.
- Brief review of most recent methods developments and discussion of their potential for immediate use in practice.

Objectives: The first objective is to provide an example-centered overview of the most commonly used propensity score-based methods in observational studies. The second objective is to present the practical implementation of these methods and highlight the newly developed SAS PROCs CAUSALTRT and PSMATCH. The third objective is to discuss the advantages and disadvantages associated with these methods.

About the Instructor

Dr. Andrei received a Ph.D. degree in Biostatistics from the University of Michigan in 2005. He is currently an Associate Professor in the Department of Preventive Medicine at Northwestern University, where he enjoys successful collaborations in cardiovascular outcomes research. He has developed expertise in MSMs and published relevant studies in adult cardiac surgery. He has developed practice-inspired and -oriented statistical methods in survival analysis, recurrent events, group sequential monitoring methods, hierarchical methods, and predictive modeling. In the last 15 years, Dr. Andrei has collaborated with medical researchers in fields such as pulmonary/critical care, organ transplantation, nursing, prostate and breast cancer, anesthesiology and thoracic surgery. Currently, he serves as Statistical Co-Editor of the Journal of the American College of Surgeons and deputy Statistical Editor of the Journal of Thoracic and Cardiovascular Surgery.

Relevance to Conference Goals

Upon attending this short-course course, participants will gain familiarity with propensity score-based methods for estimating causal effects in observational studies. Implementation in R and SAS software will be covered in detail, which will permit participants to integrate these useful data science techniques into their professional activities and projects. Learning how to produce simple yet powerful graphics to assess the propensity score model adequacy, check covariate balance and display the results, will undoubtedly benefit every participant. By leveraging their enhanced set of skills, individuals across industries will be adequately positioned to become more effective communicators in their interactions with customers and clients. Continued professional development is key to one’s career growth and can enhance the overall analytical capabilities within their respective organizations and institutions.

Sat, Feb 16
2:00 PM - 4:00 PM
Commerce

T2 - Data Science Workflows Using R and Spark Fill out evaluation
Tutorial

Instructor(s): Jim Harner, West Virginia University

This tutorial covers the data science process using R as a programming language and Spark as a big-data platform. Powerful workflows are developed for data extraction, data tidying and transformation, data modeling, and data visualization.

During the course R-based examples show how data is transported from data sources into the Hadoop Distributed File System (HDFS), into relational databases, and directly into Spark's real-time compute engine. Workflows using `dplyr' verbs are used for data manipulation within R, within relational databases (PostgreSQL), and within Spark using `sparklyr'. These data-based workflows extend into machine learning algorithms, model evaluation, and data visualization.

The machine learning algorithms include supervised techniques such as linear regression, logistic regression, gradient-boosted trees, and random forests. Feature selection is done primarily by regularization and models are evaluated using various metrics. Unsupervised techniques include k-means clustering and dimension reduction.

Big-data architectures are discussed including the Docker containers used for building the short-course infrastructure called RSpark.

Outline & Objectives

Modules:

1. Fundamentals: Linux; RSpark; RStudio; Git; Data Science Process [20 min]

2. Data Sources: Text; JSON; PostgreSQL; Web [20 min]

3. Data Transformation: Data Cleaning; `tidyr'; `dplyr' [20 min]

4. Hadoop: HDFS as a Persistent Data Store for Spark [30 min]

5. `sparklyr': Spark DataFrames; `dplyr' Interface [30 min]

6. Supervised Learning: Regression and Classification Workflows with Spark [60 min]

7. Unsupervised Learning: Dimension Reduction and Clustering with Spark [30 min]

The first three modules will not be covered in detail since the focus is the last four. However, the content in modules 1--3 contain critical information for understanding the latter modules.

The objectives of this course are to:

• extract static and streaming data from data sources,

• transform data into structured form,

• load data into relational and persistent, distributed data stores,

• build models using machine learning algorithms,

• validate and test models based on evaluation metrics,

• visualize big data and model metrics.

About the Instructor

E. James Harner is Professor Emeritus of Statistics and Adjunct Professor of Business Data Analytics at West Virginia University. He was the Chair of the Department of Statistics for 17 years and the Director of the Cancer Center Bioinformatics Core for 15 years. Currently, he is the Chairman of the Interface Foundation of North America which has partnered with the American Statistical Association to organize the annual Symposium on Data Science and Statistics (SDSS). The areas of his technical and research expertise include: bioinformatics, high-dimensional modeling, high-performance computing, streaming and big data modeling, and statistical machine learning.

This course is based on a two-day workshop developed for the National Institute of Statistical Sciences (NISS): https://www.niss.org. The two-day version has been successfully taught three times (at ASA headquarters and at UC Riverside in September, 2017 and at the U. of Toronto in April, 2018). A one-day version of this course will be taught at the Symposium on Data Science and Statistics in May, 2018 and at the Joint Statistical Meeting in August/September, 2018.

Relevance to Conference Goals

Unlike many data science short courses, RSpark provides big-data platforms, i.e., R, Hadoop and Spark and their ecosystems. This is difficult for most instructors since the infrastructure is difficult to build. Thus, attendees will get a realistic taste of what data science really is.

The full data science process is taught, but the focus is on machine learning and the underlying R code. What is taught is a realistic representation of what is done in practice.

Communication of results is done through reproducible reports and data visualizations, which are often the endpoints of pipelines in R and Spark. Collaboration is prinarily done using Git and GitHub although code sharing within RStudio is also discussed. Data science in practice is almost always a team effort and parts of this collaboration are taught.

This course offers a unique opportunity for professional development since a real data science platform is used. It is possible to scale RSpark using container orchestration, but the containers used within this course are essentially indistinguishable from a production environment.

Sat, Feb 16
2:00 PM - 4:00 PM
Canal

T3 - Writing Your Own R Package from Scratch Fill out evaluation
Tutorial

Instructor(s): Amy Yang, Uptake

It is time to take the next step and start wrapping all your utility functions, that are scattered across numerous .R files, into R packages to help with code organization, distribution, and consistent documentation.

In this hands-on tutorial, I will introduce step-by-step how to build your very own R package. If you've used R, you've almost certainly used a package - but did you know that building your own package is actually not hard at all? If you have written bits of useful code you want to keep and return to, you might want a package.

After this session, participants will have the skills to start a package and document their functions, and resources to use for next steps like vignettes and unit testing. During the tutorial, participants can follow along using provided scripts.

Outline & Objectives

This hands-on tutorial includes the following sections:
1. Setup R and install required packages
2. Create the framework for your package
3. Add functions to the package
4. External dependencies
5. Documentation
6. Install and use your package
7. (Bonus) Distribute your package on GitHub

After this session, participants will have the skills to start a package and document their functions, and resources to use for next steps like vignettes and unit testing. During the tutorial, participants can follow along using provided scripts.

About the Instructor

Amy Yang is a Sr. Data Scientist at Uptake where she conducts industrial analytics and build prediction models to major industries and help them increase productivity, security, safety and reliability.She began using R for simulation and statistical analysis during her study at the University of Pennsylvania where she received her MS degree in Biostatistics. She also teaches R programming and statistical courses for graduate students. You can find her on twitter @ayanalytics

Outside of work, Amy co-organizes the Chicago RLadies meetup group where she helps promoting R, inviting women speakers from different data science fields to give talks. Her goal is to create a friendly network among women who use R!

Amy also mentors PhD and master students on their quantitative dissertations. She enjoys the teaching aspect of doing Data Science.

Relevance to Conference Goals

The tutorial is relevant and touches these areas of the conference theme.

1. Communication and Collaboration
No more emailing .R scripts! An R package gives an easy way to distribute code to others. Especially if you put it on GitHub.

2. Consistent documentation
I can barely remember what half of my functions do let alone the inputs and outputs. An R package provides a great consistent documentation structure and actually encourages you to document your functions.

3. Code Organization and reproducibility
Are you trying to figure out where that “function” you wrote months, weeks, or even days ago? Often times, people in statistics end up just re-writing it because it is faster than searching all the .R files. An R package would help in organizing where your functions go.

Sat, Feb 16
2:00 PM - 4:00 PM
Magazine

T4 - Simulation Design and Reporting with Applications to Drug Development Fill out evaluation
Tutorial

Instructor(s): Greg Cicconetti, AbbVie; Inna Perevozskaya, GlaxoSmithKline

Simulation methods have become an increasingly important tool in the search for more efficient clinical trial designs and/or statistical analysis procedures. During our short course we will provide a road map to developing and executing a successful simulation plan and communicating these results with a broader team. We will begin with a survey of problems one might encounter during the design, monitoring and analysis stages of a clinical trial for which a simulation study may provide some insight. We continue with an introduction to standard methods for generating random data. This discussion will include methods to mimic real-world data that do not adhere to standard statistical distributions, methods to introduce correlation among endpoints, parametric and non-parametric bootstrapping techniques, and the use of historic data to simulate future data. Having established this foundation, we return to some of our motivating problems and discuss their simulation-based solutions in greater depth. Though extensive R code will be provided to supplement this tutorial, our emphasis will be on the important concepts and principles of good simulation design and reporting.

Outline & Objectives

Tentative Course Outline: a subset of topics may be replaced with more contemporary materials
• Welcome and introduction
• Some motivation for simulation
• Modeling randomness
• Enrollment modeling
• Simulating correlated data
• An application using simulated correlated endpoints
• Leveraging historic data to aide in simulation
• Case study: Robustness of efficacy to early withdrawers in an outcomes study
• Case Study: Recurrent events
• Simulation Size – How large is large?
• Closing remarks

Course Objectives:
• Provide an introduction to statistical simulation
• Contrast theory and iterative problem solving
• Demonstrate simulation concepts via examples
• Simulation planning
• Communicating & drawing inferences from simulation
• Focus is not on coding and syntax or deep theory

About the Instructor

Greg Cicconetti, Ph.D., Statistical Innovations, Data and Statistical Sciences, AbbVie. Greg began his career as an assistant professor of statistics at Muhlenberg College before joining the pharmaceutical industry in 2005. In his roles at GlaxoSmithKline and AbbVie, Greg has gained extensive experience in survival and longitudinal trials, Bayesian methodology, and statistical learning. He has used simulation to guide teams regarding trial design, monitoring, and sensitivity analyses. In his current position Greg assists study teams in determining decision criteria to be used at interim analyses, effectively marrying simulation and visualization to build team consensus. Portions of the planned course material were delivered at the 2014 Deming Conference and also used in the graduate level Advanced Statistical Computing course at Drexel University taught by Greg in 2015. Greg is also a member of the DIA Scientific Working Group on Adaptive Designs and has participated in the development of a manuscript, along with other industry experts, advocating best practices in simulation reporting.

Relevance to Conference Goals

While this course is intended to be an introduction to simulation design and reporting, the attendee will be exposed to new statistical methodologies currently being employed to support on-going trials. Our discussion on simulation reporting will emphasize the importance of clearly articulating one's simulation design and summarizing pertinent simulation output in a way that facilitates collaboration with multiple stakeholders. Although we will use drug development and clinical trial design as a backdrop for explaining important simulation concepts, the core ideas presented should readily translate to those in other fields.

Sat, Feb 16
4:15 PM - 5:30 PM
Jackson

GS2 - Closing General Session
General Session

Chair(s): Kim Love, K. R. Love Quantitative Consulting and Collaboration

The Closing Session is an opportunity for you to interact with the CSP Steering Committee in an open discussion about how the conference went and how it could be improved in future years. CSPSC vice chair, Kim Love, will lead a panel of committee members as they summarize their conference experience. The audience will then be invited to ask questions and provide feedback. The committee highly values suggestions for improvements gathered during this time. The best student poster will also be awarded during the Closing Session, and each attendee will have an opportunity to win a door prize.

Online Program

American Statistical Association