Back to search menu
Thursday, February 14
Thu, Feb 14
7:00 AM - 6:30 PM
3rd Floor Registration Counter S
Registration
Registration
Thu, Feb 14
8:00 AM - 5:30 PM
Commerce
Instructor(s): Frank Harrell, Vanderbilt University
All standard regression models have assumptions that must be verified for the model to have power to test hypotheses and for it to be able to predict accurately. Of the principal assumptions (linearity, additivity, distributional), this course will emphasize methods for assessing and satisfying the first two. Practical but powerful tools are presented for validating model assumptions and presenting model results. This course provides methods for estimating the shape of the relationship between predictors and response using the widely applicable method of augmenting the design matrix using restricted cubic splines. Even when assumptions are satisfied, overfitting can ruin a model's predictive ability for future observations. Methods for data reduction will be introduced to deal with the common case where the number of potential predictors is large in comparison with the number of observations. Methods of model validation (bootstrap and cross-validation) will be covered, as will auxiliary topics such as modeling interaction surfaces, variable selection, overly influential observations, collinearity, and shrinkage, and a brief introduction to the R rms package for handling these problems. The methods covered will apply to almost any regression model, including ordinary least squares, logistic regression models, ordinal regression, quantile regression, longitudinal data analysis, and survival models.
Outline & Objectives
1. Introduction; Advantages of prediction over classication
2. Hypothesis Testing vs. Estimation vs. Prediction vs. Classication
3. How Many Degrees of Freedom does a Data Mining Procedure Ac-
tually Have?
4. Regression Model Notation
5. Model Formulations
6. Interpreting Model Parameters
(a) Nominal Predictors
(b) Interactions
7. Relaxing Linearity Assumption for Continuous Predictors
(a) Categorization is not an alternative
(b) Simple Nonlinear Terms
(c) Splines for Estimating Shape of Regression Function and Deter-
mining Predictor Transformations
(d) Cubic Spline Functions
(e) Restricted Cubic Splines
(f) Choosing Number and Position of Knots
(g) Nonparametric smoothers and regression trees
(h) Advantages of Splines over Other Methods
8. Multiple Degree of Freedom Tests of Association
9. Assessment of Model Fit
(a) Regression Assumptions
(b) Modeling and Testing Interactions
10. Multivariable Modeling Strategy
(a) Why and How To Pre-specify Model Complexity
(b) Problems Caused by Ordinary Stepwise Variable Selection
(c) Collinearity
(d) Shrinkage
(e) Data Reduction
(f) Overly In
uential Observations
(g) Some Useful Modeling Strategies for
i. Prediction
ii. Estimation
iii. Hypothesis Testing
11. Overview of the Bootstrap
12. Model Validation
(a) Cross-validation
(b) Bootstrap
13. Graphical Methods for Interpreting Complex Regression Fits
14. Detailed Case Studies
(a) Generalized Least Squares for Serial Data
(b) Ordinal Regression for Continuous Y : Predicting glycohemoglobin
(and pre-diabetes) from body size characteristics using NHANES
data
(c) Binary Logistic Regression: Survival Patterns of Passengers on
the Titanic
(d) Survival Modeling
A more detailed outline is available at biostat.mc.vanderbilt.edu/rms.
About the Instructor
Dr. Harrell is Professor of Biostatistics, Founding Chair of the
Department of Biostatistics of Vanderbilt University School of Medicine, and Expert Statistical Advisor, Oce of Biostatistics, Center for Drug
Evaluation and Research, US FDA. Prior to starting the new department
in 2003 he was Chief of the Division Biostatistics and Epidemiology in the
Department of Health Evaluation Sciences, University of Virginia School
of Medicine. Prior to coming to the University of Virginia in 1996 he
was in the Division of Biometry at Duke University Medical Center for 17
years. He received his Ph.D. in biostatistics from the University of North
Carolina, Chapel Hill in 1979, where he studied under P.K. Sen. Dr.
Harrell's interests include statistical modeling and model validation, sta-
tistical computing and graphics, reproducible research, survival analysis,
clinical trials, health services and outcomes research, medical diagnostic
and prognostic models, bootstrapping, missing data, and Bayesian mod-
eling. He is an associate editor of Statistics in Medicine, a member of
the editorial board for American Heart Journal, a member of Faculty of
1000 Medicine, on the editorial policy board for the Journal of Clinical
Epidemiology and a member of the Scientic Advisory Board, for Science
Translational Medicine. For many years he has been a consultant to FDA
and the pharmaceutical industry. He is author of the book Regression
Modeling Strategies, Second Edition (Springer, 2015) and teaches courses
in biostatistical modeling. He was the recipient of the American Statisti-
cal Association's WJ Dixon award for excellence in statistical consulting
in 2014.
Relevance to Conference Goals
This is an applied statistics course that teaches regression analysis and predictive modeling tools that have wide applicability, and should be of great value to almost all practicing statisticians.
Thu, Feb 14
8:00 AM - 5:30 PM
Royal
SC2 -
Big Data, Data Science, and Deep Learning for Statisticians
Fill out evaluation
Short Course (full day)
Instructor(s): Ming Li, Amazon; Hui Lin, Netlify
With recent big data, data science and deep learning revolution, enterprises ranging from FORTUNE 100 to startups across the world are hungry for data scientists and machine learning scientists to bring actionable insight from the vast amount of data collected. In the past a couple of years, deep learning has gained traction in many application areas and it becomes an essential tool in data scientist’s toolbox. In this course, students will develop a clear understanding of the big data cloud platform, technical skills in data sciences and machine learning, and especially the motivation and use cases of deep learning through hands-on exercises. We will also cover the “art” part of data science and machine learning to guide participants to learn typical agile data science project flow, general pitfalls in data science and machine learning, and soft skills to effectively communicate with business stakeholders. This course will prepare statisticians to be successful data scientists and deep learning scientist in various industries and business sectors.
Outline & Objectives
The big data platform, data science, and deep learning overviews are specifically designed for audience with statistics education background. The data science workflow, pitfalls and soft skills are highlight through real-world data science and machine learning problems. The Databricks community edition cloud platform will be used throughout the training course to cover hands-on sessions including: (1) big data platform using Spark through R sparklyr package; (2) introduction to Deep Neural Network, Convolutional Neural Network and Recurrent Neural Networks and their applications; (3) deep learning examples using TensorFlow through R keras package. The primary audiences for this course are: (1) statistician in traditional industry sectors such as manufacturing, pharmaceutical and banking; (2) statistician in government agencies; (3) statistical researchers in universities; (4) graduate students in statistics departments. The prerequisite knowledge is MS level education in statistics and entry level of R knowledge. No software installation is needed in students’ laptop and the cloud platform is easily accessed through browsers such as Chrome or Firefox with internet connection.
About the Instructor
Both instructors have Ph.D. in Statistics from Iowa State University and have worked in data science and machine learning areas for a number of years. Dr. Li is a Sr. Data Scientist at Amazon and Dr. Lin is a Data Scientist at Netlify. Before Amazon, Dr. Li was at Walmart, SAS and GE and he was the Chair of the 2017 Quality and Productivity Section of ASA. Dr. Lin was a leader at DuPont on applying advanced data science to enhance marketing and sales effectiveness and she is the co-founder of Central Iowa R User Group and blogger of scientistcafe.com. With deep statistics background and a few years’ industrial experiences in data science, they have trained and mentored numerous junior data scientist with diversified background. They have taught a similar continue education course without the deep learning part at the 2017 JSM, and they will teach similar courses at Joint Research Conference, ICSA Applied Statistics Symposium and Fall Technical Conference in 2018. Dr. Li organized and will present at the Introductory Overview Lecture “Leading Data Science: Talent, Strategy, and Impact” at the 2018 JSM. Dr. Li is also an Instructor of Amazon’s internal Machine Learning University.
Relevance to Conference Goals
This short course fit the conference goals very well. It focuses on Big Data and Data Science applications in real-world problems including the new development of deep learning. With the focus on the cloud platform, students can learn the current trend of data science software and big data infrastructure used by tech companies such that they can expand their programming scope to cover more applications in data science and machine learning. The short course also includes the needed soft-skill discussions to prepare students with better understanding of the data science project flow, pitfalls in machine learning and communication skills. This course keeps statistician’s background in mind to bridge the gaps between a traditional statistician and a successful data scientist. After taking the course, students will be confident to positively impact their organization by transforming their current traditional statistics team into a data science or machine learning team or to explore data scientist or machine learning scientist opportunities for their future career development.
Thu, Feb 14
8:00 AM - 5:30 PM
Canal
Instructor(s): Richard D. De Veaux, Williams College
This seminar is a practical introduction to and an overview of the techniques and strategies of data mining. While I will discuss the models in detail, the course will be application rather than theoretically oriented. Many of the standard techniques of data mining, including modern model selection strategies for multiple regression such as the lasso, elastic net etc will be presented. In addition we'll cover classification and regression trees, neural networks, principal components, Naïve Bayes, bagging, and boosting. The course will be problem solving based, using real case studies from science and industry to illustrate which methods work well, when and why. We will emphasize problem formulation, the challenges of the process and the communication that is necessary back to decision makers to effect maximum impact in the organization. No prerequisites other that a knowledge of the basics of regression are assumed. The applications will come from a wide variety of industries and include some applications from my personal experiences as a consultant for companies that deal with such topics as financial services, chemical processing, pharmaceuticals, and insurance.
Outline & Objectives
Outline:
I. Introduction to data mining a. What is data mining? b. What are the applications? c. How does it differ from statistics? 2. Formulating the problem a. Data considerations b. How to evaluate the methods c. Testing and training 3. The methods -- overview of the most commonly used algorithms use 4. Case studies a. In depth comparisons of the methods and how they helped solve the problem b. Challenges to communication 5. Summary
Learning Objectives
(a) Learning outcomes (performance objectives): In the process of analyzing the data sets, attendees will learn how to: • Identify appropriate problems for data mining • Learn how to explore and prepare data for mining • Use a variety of techniques including decision trees and neural nets to build accurate predictive models • Evaluate the quality of models • Select the appropriate data mining tools for applications. (b) Content and instructional methods: The presentation provides interaction between the participant and the material by involving audience participation in the data analyses.
About the Instructor
Richard De Veaux (Dick), Ph.D is C. Carlise and Margaret Tippit Professor of Statistics at Williams College. He holds degrees in Civil Engineering (B.S.E. Princeton), Mathematics (A.B. Princeton), Dance Education (M.A.) and Statistics (Ph.D.) at Stanford where he studied statistics with Persi Diaconis and dance with Inga Weiss. Dick has taught at the Wharton School and Princeton Universityand has been a visiting researcher at INRA in Montpellier and a visiting professor at Paris V. De Veaux has won numerous teaching awards from the Engineering Council at Princeton. He has won both the Wilcoxon and Shewell (twice) awards from the ASQ is a fellow of the ASA and an elected member of the ISI. In 2006-2007 he was the William R. Kenan Jr. Visiting Professor for Distinguished Teaching at Princeton. In 2008 he was named the Statistician of the Year by the Boston Chapter of the ASA. He has served on the Board of Directors of the ASA and is past chair of the Section on Statistical Learning and Data Science and is the 2019-2021 Vice President. Dick has been a consultant for over 30 years for such Fortune 500 companies as Hewlett-Packard, Alcoa, American Express, Bank One,GlaxoSmithKline.
Relevance to Conference Goals
Directly relevant to themes of practical issues in big data.
Thu, Feb 14
8:00 AM - 12:00 PM
Magazine
Instructor(s): Tim C. Hesterberg, Google
We begin with a graphical approach to bootstrapping and permutation testing, illuminating basic statistical concepts of standard errors, confidence intervals, p-values and significance tests.
We consider a variety of statistics (mean, trimmed mean, regression, etc.), and a number of sampling situations (one-sample, two-sample, stratified, finite-population), stressing the common techniques that apply in these situations. We'll look at applications from a variety of fields, including telecommunications, finance, and biopharm.
These methods let us do confidence intervals and hypothesis tests when formulas are not available. This lets us do better statistics, e.g. use robust methods (we can use a median or trimmed mean instead of a mean, for example). They can help clients understand statistical variability. And some of the methods are more accurate than standard methods.
Outline & Objectives
Introduction to Bootstrapping
General procedure
Why does bootstrapping work?
Sampling distribution and bootstrap distribution
Bootstrap Distributions and Standard Errors
Distribution of the sample mean
Bootstrap distributions of other statistics
Simple confidence intervals
Two-sample applications
How Accurate Is a Bootstrap Distribution?
Bootstrap Confidence Intervals
Bootstrap percentiles as a check for standard intervals
More accurate bootstrap confidence intervals
Significance Testing Using Permutation Tests
Two-sample applications
Other settings
Wider variety of statistics
Variety of applications
Examples where things go wrong, and what to look for
Wider variety of sampling methods
Stratified sampling, hierarchical sampling
Finite population
Regression
Time series
Participants will learn how to use resampling methods:
* to compute standard errors,
* to check the accuracy of the usual Gaussian-based methods,
* to compute both quick and more accurate confidence intervals,
* for a variety of statistics and
* for a variety of sampling methods, and
* to perform significance tests in some settings.
About the Instructor
Dr. Tim Hesterberg is a Senior Data Scientist at Google. He previously worked at Insightful (S-PLUS), Franklin & Marshall College, and Pacific Gas & Electric Co. He received his Ph.D. in Statistics from Stanford University, under Brad Efron.
Hesterberg is author of the "Resample" package for R and primary author of the "S+Resample" package for bootstrapping, permutation tests, jackknife, and other resampling procedures, is co-author of Chihara and Hesterberg "Mathematical Statistics with Resampling and R" (2011), and is lead author of "Bootstrap Methods and Permutation Tests" (2010), W. H. Freeman, ISBN 0-7167-5726-5, and technical articles on resampling. See http://www.timhesterberg.net/bootstrap.
Hesterberg is on the executive boards of the National Institute of Statistical Sciences and the Interface Foundation of North America (Interface between Computing Science and Statistics).
He teaches kids to make water bottle rockets, leads groups of high school students to set up computer labs abroad, and actively fights climate chaos.
Relevance to Conference Goals
Resampling methods are important in statistical practice, but are omitted or poorly covered in many old-style statistics courses. These methods are an important part of the toolbox of any practicing statistician.
It is important when using these methods to have some understanding of the ideas behind these methods, to understand when they should or should not be used.
They are not a panacea. People tend to think of bootstrapping in small samples, when they don't trust the central limit theorem. However, the common combinations of nonparametric bootstrap and percentile intervals is actually accurate than t procedures. We discuss why, remedies, and better procedures that are only slightly more complicated.
These tools also show how poor common rules of thumb are -- in particular, n >= 30 is woefully inadequate for judging whether t procedures should be OK.
Thu, Feb 14
8:00 AM - 12:00 PM
Jackson
Communicating Data Clearly describes how to draw clear, concise, accurate graphs that are easier to understand than many of the graphs one sees today. The course emphasizes how to avoid common mistakes that produce confusing or even misleading graphs. Graphs for one, two, three, and many variables are covered as well as general principles for creating effective graphs.
Outline & Objectives
This course begins by reviewing human perception and our ability to decode graphical information. It continues by:
• Ranking elementary graphical perception tasks to identify those that we do the best.
• Showing the limitations of many common graphical constructions.
• Demonstrating newer, more effective graphical forms developed on the basis of the ranking.
• Providing general principles for creating effective graphs.
• Commenting on software packages that produce graphs.
• Comparing the same data using different graph forms so the audience can see how understanding depends on the graphical construction used.
• Discussing Trellis Display (a framework for the visualization of multivariate data) and other innovative methods for presenting more than two variables.
• Presenting some graphical methods for categorical data.
Since scales (the rulers along which we graph the data) have a profound effect on our interpretation of graphs, the section on general principles contains a detailed discussion of scales.
The course concludes with before and after examples that reinforce the topics covered.
About the Instructor
Naomi B. Robbins is a consultant and seminar leader who specializes in the graphical display of data. She offers keynotes, short courses and workshops to train employees of corporations and organizations on the effective presentation of data. She also reviews documents and presentations for clients, suggesting improvements or alternative presentations as appropriate. She is the author of Creating More Effective Graphs, published by Chart House (2013). Dr. Robbins has been the keynote speaker at international conventions and has spoken on graphs to universities, professional societies, corporations, and non-profits. She received her Ph.D. in mathematical statistics from Columbia University, M.A. from Cornell University, and A.B. from Bryn Mawr College. She had a long career at Bell Laboratories before forming NBR, her consulting practice. Naomi was chair of the Statistical Graphics Section of the American Statistical Association and is the organizer of the Data Visualization New York Meetup.
Relevance to Conference Goals
Attendees will be exposed to graphical techniques, some of which may be new to them. Ideas covered are immediately applicable.
The entire emphasis of the course is to use best graphical practices to communicate quantitative information better.
Effective charts and graphs and understanding data better lead to better decisions which have a positive impact on the company. Communicating data better saves time at meetings.
Better communication of data enhances one’s career and avoids the loss of credibility that comes with using confusing, misleading or deceptive figures.
Thu, Feb 14
1:30 PM - 5:30 PM
Magazine
SC6 -
Structural Equation and Multilevel Modeling Approaches to Examining Change Over Time
Fill out evaluation
Short Course (half day)
Instructor(s): Kevin John Grimm, Arizona State University
This half day workshop discusses growth models from the multilevel and structural equation modeling perspectives. Growth models have become a mainstay of longitudinal data analysis in the social and behavioral sciences to examine how individuals change over time and how individuals differ in their change process. The workshop covers several introductory topics that range from linear and nonlinear growth models to the inclusion of time-invariant and time-varying covariates. For analysis, we will discuss and use the structural equation modeling and multilevel modeling frameworks available through R and Mplus. The training is intended for faculty, postdocs and advanced graduate students who are familiar with structural equation modeling and multilevel modeling.
Outline & Objectives
The objectives of this full day workshop are to (1) understand the uniqueness of longitudinal data and the challenges of modeling individual change over time, (2) estimate linear and nonlinear growth models using R and Mplus, (3) interpret model parameters and their importance, and (4) estimate models with time-invariant and time-varying covariates while distinguishing between within- and between-person effects.
About the Instructor
Kevin J. Grimm, Ph.D., is a Professor in the Department of Psychology at Arizona State University, where he teaches classes on the analysis of variance, longitudinal growth modeling, machine learning, and structural equation modeling. He received his B.A. in Mathematics and Psychology with a concentration in Education from Gettysburg College in 2000, his M.A. and Ph.D. in Psychology from the University of Virginia (2001-2006). His research interests include longitudinal methodology, exploratory data analysis, and data integration, especially the integration of longitudinal studies. His recent research has focused on nonlinearity in growth models, growth mixture models, extensions of latent change score models, and approaches for analyzing change with limited dependent variables. Dr. Grimm directs the American Psychological Association’s Advanced Training Institutes on Structural Equation Modeling in Longitudinal Research and Big Data: Exploratory Data Mining.
Relevance to Conference Goals
This workshop is in line with the goals of the conference for Data Modeling and Analysis and well as Communication. This workshop will engage the audience to consider the various possibilities of modeling longitudinal data, and be able to communicate their findings to wider audiences.
Thu, Feb 14
1:30 PM - 5:30 PM
Jackson
SC7 -
How to Best Use Analytical Skills as a Statistician to Influence Quantitative Decision-Making
Fill out evaluation
Short Course (half day)
Instructor(s): Achim Guettner, Novartis Pharma AG; Peter Grant Mesenbrink, Novartis Pharmaceuticals Corporation
Download Handouts
Decisions on projects are often not made only from algorithms or based on the recommendation of statisticians. The importance of well-balanced technical and non-technical skills is essential for a statistician to be successful in collaborating and leading multi-disciplinary projects. With a well-balanced skill set, statisticians have the opportunity to use their analytical abilities to the fullest extent. However, to maximize the use of these skills, this requires statisticians to move outside of their comfort zone in order to excel in the leadership of cross-functional teams, to demonstrate strong communication and collaboration skills and to manage the conflicts that may occur when facing challenges outside of the realm of statistics.
Outline & Objectives
This short course will focus on providing statisticians with the guidance on the non-technical skills that they need to develop expertise to be successful in quantitative decision making when working with non-statisticians. The first half of the course will focus on providing guidance on the best practices for statisticians to be successful with their oral and written communication through case studies of real world scenarios from work as a statistician operating in cross-functional teams. The scope of these topics will include coverage of: active listening, asking the right questions, using the right vocal tones for the situation, networking, emotional intelligence, self-awareness of the surround environment, and receiving and providing further feedback. References to further reading and online material will be given. The second half of the short-course will focus on how statisticians can make best use of their analytical skills and become successful cross-functional leaders. Time will also be spent on understanding how to best handle conflicts and how to use analytical skills to win the right battles that statisticians face on a daily basis.
About the Instructor
The two speakers have more than 40 years of experience combined as statisticians in the pharmaceuticals industry. Dr. Mesenbrink has been an active spokesperson for the Leadership Initiative within the American Statistical Association while Dr. Guettner is leading an initiative within Novartis on leadership and soft skill development for statisticians. In addition to external publications and presentations on the subject matter, Dr. Mesenbrink is currently finishing the writing of the book: How to be a Successful Biostatistician in Industry for CRC Press which is projected to be published by the end of 2018.
Relevance to Conference Goals
As a meeting that is intended to help statisticians obtain practical needed for them to grow in their careers, this short course is aligned with the conference goals to provide statistician with practical knowledge needed to help them to continue grow and expand their potential career paths.
Thu, Feb 14
5:30 PM - 7:00 PM
St. James Ballroom
Exhibits Open
Exhibits
Thu, Feb 14
5:30 PM - 7:00 PM
St. James Ballroom
PS1 - Poster Session 1 and Opening Mixer
Poster Session
Chair(s): Cate Knockenhauer, Conagra
1
2
3
4
5
6
7
8
9
11
12
13
14
16
17
18
19
20
21
Friday, February 15
Fri, Feb 15
7:30 AM - 5:30 PM
3rd Floor Registration Counter S
Registration
Registration
Fri, Feb 15
7:30 AM - 6:30 PM
St. James Ballroom
Exhibits Open
Exhibits
Fri, Feb 15
7:30 AM - 8:30 AM
St. James Ballroom
Continental Breakfast
Other
Fri, Feb 15
8:00 AM - 9:00 AM
St. Charles
Chair(s): Eric A. Vance, LISA-University of Colorado Boulder
Fri, Feb 15
9:15 AM - 10:45 AM
St. Charles
Chair(s): Emily M. Slade, University of Kentucky
Fri, Feb 15
9:15 AM - 10:45 AM
Canal
Chair(s): Jay Mandrekar, Mayo Clinic
Fri, Feb 15
9:15 AM - 10:45 AM
Jackson
Chair(s): Jana Anderson, Colorado State University
Fri, Feb 15
9:15 AM - 10:45 AM
Magazine
CS04 -
Extending Existing Tools with Applications and Languages
Fill out evaluation
Concurrent Session
Chair(s): Eric Tesdahl, SpecialtyCare
Fri, Feb 15
11:00 AM - 12:30 PM
St. Charles
Chair(s): Julia L. Sharp, Colorado State University
Fri, Feb 15
11:00 AM - 12:30 PM
Canal
Chair(s): Sana N. Charania, CDC
Fri, Feb 15
11:00 AM - 12:30 PM
Jackson
Chair(s): Billy Bridges, Clemson
Fri, Feb 15
11:00 AM - 12:30 PM
Magazine
Chair(s): Roxy Cramer, Rogue Wave Software, Inc.
Fri, Feb 15
12:30 PM - 2:00 PM
Lunch (On Own)
Other
Fri, Feb 15
2:00 PM - 3:30 PM
St. Charles
Chair(s): Christine Luketic, Virginia Tech
Fri, Feb 15
2:00 PM - 3:30 PM
Canal
Chair(s): Xinling (Claire) Xu, Beth Israel Deaconess Medical Center
Fri, Feb 15
2:00 PM - 3:30 PM
Jackson
Chair(s): Wendy Martinez, Bureau of Labor Statistics
Fri, Feb 15
2:00 PM - 3:30 PM
Magazine
Chair(s): Duke Butterfield, Mayo Clinic
Fri, Feb 15
3:45 PM - 5:15 PM
St. Charles
Chair(s): Layla Guyot, Texas State University
Fri, Feb 15
3:45 PM - 5:15 PM
Canal
Chair(s): Cynthia S. Crowson, Mayo Clinic
Fri, Feb 15
3:45 PM - 5:15 PM
Jackson
Chair(s): Qingyang Zhang, University of Arkansas
Fri, Feb 15
3:45 PM - 5:15 PM
Magazine
Chair(s): Natasha Hurwitz, National Institutes of Health
Fri, Feb 15
5:15 PM - 6:30 PM
St. James Ballroom
PS2 - Poster Session 2 and Refreshments
Poster Session
Chair(s): Fabio D'Ottaviano, The Dow Chemical Company
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Saturday, February 16
Sat, Feb 16
7:30 AM - 2:30 PM
3rd Floor Registration Counter S
Registration
Registration
Sat, Feb 16
7:30 AM - 1:00 PM
St. James Ballroom
Exhibits Open
Exhibits
Sat, Feb 16
8:00 AM - 9:15 AM
St. James Ballroom
PS3 - Poster Session 3 and Continental Breakfast
Poster Session
Chair(s): Charles Minard, Baylor College of Medicine
1
3
4
5
6
7
8
10
11
12
14
15
16
17
18
19
Sat, Feb 16
9:15 AM - 10:45 AM
Camp
Chair(s): Caitlin Mary Cunningham, Le Moyne College
Sat, Feb 16
9:15 AM - 10:45 AM
Canal
Chair(s): Ella Revzin, Precima
Sat, Feb 16
9:15 AM - 10:45 AM
Jackson
Chair(s): Birol Emir, Pfizer Inc.
Sat, Feb 16
9:15 AM - 10:45 AM
Magazine
Chair(s): Naomi B. Robbins, NBR
Sat, Feb 16
11:00 AM - 12:30 PM
Camp
Chair(s): Kayéromi Gomez, University of Illinois College of Medicine
Sat, Feb 16
11:00 AM - 12:30 PM
Canal
CS22 -
Behind the Model: Modeling Approaches and Strategies
Fill out evaluation
Concurrent Session
Chair(s): Steven B. Cohen, RTI International
Sat, Feb 16
11:00 AM - 12:30 PM
Jackson
Chair(s): Raja Velu, Syracuse University
Sat, Feb 16
11:00 AM - 12:30 PM
Magazine
Chair(s): Sejong Bae, University of Alabama at Birmingham
Sat, Feb 16
12:30 PM - 2:00 PM
Lunch (On Own)
Other
Sat, Feb 16
2:00 PM - 4:00 PM
Camp
PCD1 -
Introduction to Structural Equation Modeling Using Stata
Fill out evaluation
Practical Computing Demo
Instructor(s): Chuck Huber, StataCorp
This workshop introduces the concepts and jargon of structural equation modeling (SEM) including path diagrams, latent variables, endogenous and exogenous variables, and goodness of fit. I will describe the similarities and differences between Stata's -sem- and -gsem- commands. Then I demonstrate how to fit many familiar models such as linear regression, multivariate regression, logistic regression, confirmatory factor analysis, and multilevel models using -sem- and -gsem-. I conclude demonstrating how to fit structural equation models that contain both structural and measurement components.
Outline & Objectives
Participants will learn about the following concepts and tools:
Observed and latent variables
Exogenous and endogenous variables
Recursive and nonrecursive models
Model assumptions
Checking the fit of a structural equation model
How to draw a path diagram using Stata’s SEM Builder
How to use Stata’s -sem- command syntax
How to use Stata’s -gsem- command syntax
Differences and similarities between -sem- and -gsem-
How to fit structural equation models by group
How to constraint model parameters
How to fit a mediation model using SEM
How to estimate descriptive statistics such as sample means, variance, and correlation with SEM
How to fit familiar models such as linear and logistic regression using SEM
How to fit confirmatory factor analysis (CFA) models using SEM
About the Instructor
Chuck Huber is a Senior Statistician at StataCorp and Adjunct Associate Professor of Biostatistics at the Texas A&M School of Public Health. In addition to working with Stata's team of software developers, he produces instructional videos for the Stata YouTube channel, writes blog entries, develops online NetCourses and gives talks about Stata at conferences and universities. Most of his current work is
focused on statistical methods used by psychologists and other behavioral scientists. He has
published in the areas of neurology, human and animal genetics, alcohol and drug abuse prevention, nutrition and birth defects. Dr. Huber currently teaches introductory biostatistics at Texas A&M where he previously taught categorical data analysis, survey data analysis, and statistical genetics.
Relevance to Conference Goals
Structural equation modeling has become increasingly popular for modeling the interrelationships among a group of variables. Many researchers us SEM to understand causal relationships in complex systems. This talk introduces this powerful tool using the popular statistical package Stata.
Sat, Feb 16
2:00 PM - 4:00 PM
Jackson
PCD2 -
Interfacing R with Excel in Two Different Ways
Fill out evaluation
Practical Computing Demo
Thanks to its popularity and user-friendly environment, Microsoft Excel is widely used to gain data insights and make better decisions. However, compared to mainstream statistical software such as R, Excel lacks advanced statistical tools taken solely or integrated into procedures. On the other hand, R is a coding software associated to a steep learning curve. In order to interface the unlimited statistical possibilities of R with the user-friendly environment of Excel, two features have recently been developed within the XLSTAT software: 1) XLSTAT-R helps programmers develop user-friendly dialog boxes in Excel allowing users to launch customized R procedures directly on data selected in Excel with their mouse. 2) The XLSTAT-RNotebook allows writing R code in Excel cells with the possibility of capturing data in the form of Excel cell ranges. The outputs are also displayed in Excel. This makes it possible to create complex dashboards or reports in Excel made from R code. The created procedures can then be used by colleagues, students or clients who don’t necessarily know how to code. This tutorial shows how developers can build customized R procedures in an Excel dialog box or directly in Excel cells using XLSTAT.
Basic coding skills are required (preferably R).
Outline & Objectives
Outline:
1. Introduction to XLSTAT-R and the XLSTAT-RNotebook.
2. Application: Making the pam{cluster} R function available in an Excel dialog box and adding the possibility to customize several options and charts from within the dialog box.
3. Application: Developing a customized R-based dashboard in an Excel sheet using the XLSTAT-RNotebook.
Objectives:
At the end of this tutorial, participants will understand the basics of XLSTAT-R or the XLSTAT-RNotebook, used to develop R-based statistical applications or dashboards in Excel.
About the Instructor
Jean Paul Maalouf (PhD) is an independent statistical consultant with 10 years of experience. He has worked for 4 years at Addinsoft as the brand manager of the XLSTAT Software, leader in statistical software for Excel. He substantially contributed to the development of the XLSTAT-R engine and has created many of the default XLSTAT-R procedures included in XLSTAT solutions.
Relevance to Conference Goals
The open-source R software is known for its steep learning curve. Data-inspired decision makers often prefer relying on dashboards or user-friendly environments such as Microsoft Excel. This tutorial shows how data science, data analysis and modeling procedures built in R can be made available to any Excel user thanks to XLSTAT-R and the XLSTAT-RNotebook. These developments are possible under different collaboration scenarios. Chief programming statisticians are able to customize applications for decision makers. Consultants are able to set up Excel applications tailored to the specific needs of their customers. Professors are able to develop customized statistical programs in Excel to illustrate their courses.
Sat, Feb 16
2:00 PM - 4:00 PM
Royal
Increasingly complex observational studies are commonplace in a numerous data science settings, including biomedical, health services, pharmaceutical, insurance and online advertising. To adequately estimate causal effect sizes, proper control of known potential confounders is critical. Having gained enormous popularity in the recent years, propensity score methods are powerful and elegant tools for estimating causal effects. Without assuming prior knowledge of propensity score methods, this short course will use simulated and real data examples to introduce and illustrate important techniques involving propensity scores, such as weighting, matching and sub-classification. Relevant R and SAS software packages for implementing data analyses will be discussed in detail. Specific topics to be covered include guidelines on how to construct a propensity score model, create matched pairs for binary group comparisons, assess baseline covariate balance after matching and use inverse propensity score weighting techniques. Illustrative examples will accompany each topic and a brief review of recent relevant developments and their implementation will also be discussed.
Outline & Objectives
Outline:
- Observational Studies: definition, examples, causal effects, confounding control.
- Propensity Scores: definition, properties, modeling techniques.
- Propensity Score Approaches in Observational Studies: weighting, matching, sub-classification; graphical methods to assess covariate balance after matching;
- Illustration of these techniques using R packages MatchIt, Matching and optmatch, as well as SAS PROCs CAUSALTRT and PSMATCH.
- Guidelines on how to best describe the methodology utilized and the results obtained when presenting to a non-technical audience.
- Brief review of most recent methods developments and discussion of their potential for immediate use in practice.
Objectives: The first objective is to provide an example-centered overview of the most commonly used propensity score-based methods in observational studies. The second objective is to present the practical implementation of these methods and highlight the newly developed SAS PROCs CAUSALTRT and PSMATCH. The third objective is to discuss the advantages and disadvantages associated with these methods.
About the Instructor
Dr. Andrei received a Ph.D. degree in Biostatistics from the University of Michigan in 2005. He is currently an Associate Professor in the Department of Preventive Medicine at Northwestern University, where he enjoys successful collaborations in cardiovascular outcomes research. He has developed expertise in MSMs and published relevant studies in adult cardiac surgery. He has developed practice-inspired and -oriented statistical methods in survival analysis, recurrent events, group sequential monitoring methods, hierarchical methods, and predictive modeling. In the last 15 years, Dr. Andrei has collaborated with medical researchers in fields such as pulmonary/critical care, organ transplantation, nursing, prostate and breast cancer, anesthesiology and thoracic surgery. Currently, he serves as Statistical Co-Editor of the Journal of the American College of Surgeons and deputy Statistical Editor of the Journal of Thoracic and Cardiovascular Surgery.
Relevance to Conference Goals
Upon attending this short-course course, participants will gain familiarity with propensity score-based methods for estimating causal effects in observational studies. Implementation in R and SAS software will be covered in detail, which will permit participants to integrate these useful data science techniques into their professional activities and projects. Learning how to produce simple yet powerful graphics to assess the propensity score model adequacy, check covariate balance and display the results, will undoubtedly benefit every participant. By leveraging their enhanced set of skills, individuals across industries will be adequately positioned to become more effective communicators in their interactions with customers and clients. Continued professional development is key to one’s career growth and can enhance the overall analytical capabilities within their respective organizations and institutions.
Sat, Feb 16
2:00 PM - 4:00 PM
Commerce
Instructor(s): Jim Harner, West Virginia University
This tutorial covers the data science process using R as a programming language and Spark as a big-data platform. Powerful workflows are developed for data extraction, data tidying and transformation, data modeling, and data visualization.
During the course R-based examples show how data is transported from data sources into the Hadoop Distributed File System (HDFS), into relational databases, and directly into Spark's real-time compute engine. Workflows using `dplyr' verbs are used for data manipulation within R, within relational databases (PostgreSQL), and within Spark using `sparklyr'. These data-based workflows extend into machine learning algorithms, model evaluation, and data visualization.
The machine learning algorithms include supervised techniques such as linear regression, logistic regression, gradient-boosted trees, and random forests. Feature selection is done primarily by regularization and models are evaluated using various metrics. Unsupervised techniques include k-means clustering and dimension reduction.
Big-data architectures are discussed including the Docker containers used for building the short-course infrastructure called RSpark.
Outline & Objectives
Modules:
1. Fundamentals: Linux; RSpark; RStudio; Git; Data Science Process [20 min]
2. Data Sources: Text; JSON; PostgreSQL; Web [20 min]
3. Data Transformation: Data Cleaning; `tidyr'; `dplyr' [20 min]
4. Hadoop: HDFS as a Persistent Data Store for Spark [30 min]
5. `sparklyr': Spark DataFrames; `dplyr' Interface [30 min]
6. Supervised Learning: Regression and Classification Workflows with Spark [60 min]
7. Unsupervised Learning: Dimension Reduction and Clustering with Spark [30 min]
The first three modules will not be covered in detail since the focus is the last four. However, the content in modules 1--3 contain critical information for understanding the latter modules.
The objectives of this course are to:
• extract static and streaming data from data sources,
• transform data into structured form,
• load data into relational and persistent, distributed data stores,
• build models using machine learning algorithms,
• validate and test models based on evaluation metrics,
• visualize big data and model metrics.
About the Instructor
E. James Harner is Professor Emeritus of Statistics and Adjunct Professor of Business Data Analytics at West Virginia University. He was the Chair of the Department of Statistics for 17 years and the Director of the Cancer Center Bioinformatics Core for 15 years. Currently, he is the Chairman of the Interface Foundation of North America which has partnered with the American Statistical Association to organize the annual Symposium on Data Science and Statistics (SDSS). The areas of his technical and research expertise include: bioinformatics, high-dimensional modeling, high-performance computing, streaming and big data modeling, and statistical machine learning.
This course is based on a two-day workshop developed for the National Institute of Statistical Sciences (NISS): https://www.niss.org. The two-day version has been successfully taught three times (at ASA headquarters and at UC Riverside in September, 2017 and at the U. of Toronto in April, 2018). A one-day version of this course will be taught at the Symposium on Data Science and Statistics in May, 2018 and at the Joint Statistical Meeting in August/September, 2018.
Relevance to Conference Goals
Unlike many data science short courses, RSpark provides big-data platforms, i.e., R, Hadoop and Spark and their ecosystems. This is difficult for most instructors since the infrastructure is difficult to build. Thus, attendees will get a realistic taste of what data science really is.
The full data science process is taught, but the focus is on machine learning and the underlying R code. What is taught is a realistic representation of what is done in practice.
Communication of results is done through reproducible reports and data visualizations, which are often the endpoints of pipelines in R and Spark. Collaboration is prinarily done using Git and GitHub although code sharing within RStudio is also discussed. Data science in practice is almost always a team effort and parts of this collaboration are taught.
This course offers a unique opportunity for professional development since a real data science platform is used. It is possible to scale RSpark using container orchestration, but the containers used within this course are essentially indistinguishable from a production environment.
Sat, Feb 16
2:00 PM - 4:00 PM
Canal
Instructor(s): Amy Yang, Uptake
It is time to take the next step and start wrapping all your utility functions, that are scattered across numerous .R files, into R packages to help with code organization, distribution, and consistent documentation.
In this hands-on tutorial, I will introduce step-by-step how to build your very own R package. If you've used R, you've almost certainly used a package - but did you know that building your own package is actually not hard at all? If you have written bits of useful code you want to keep and return to, you might want a package.
After this session, participants will have the skills to start a package and document their functions, and resources to use for next steps like vignettes and unit testing. During the tutorial, participants can follow along using provided scripts.
Outline & Objectives
This hands-on tutorial includes the following sections:
1. Setup R and install required packages
2. Create the framework for your package
3. Add functions to the package
4. External dependencies
5. Documentation
6. Install and use your package
7. (Bonus) Distribute your package on GitHub
After this session, participants will have the skills to start a package and document their functions, and resources to use for next steps like vignettes and unit testing. During the tutorial, participants can follow along using provided scripts.
About the Instructor
Amy Yang is a Sr. Data Scientist at Uptake where she conducts industrial analytics and build prediction models to major industries and help them increase productivity, security, safety and reliability.She began using R for simulation and statistical analysis during her study at the University of Pennsylvania where she received her MS degree in Biostatistics. She also teaches R programming and statistical courses for graduate students. You can find her on twitter @ayanalytics
Outside of work, Amy co-organizes the Chicago RLadies meetup group where she helps promoting R, inviting women speakers from different data science fields to give talks. Her goal is to create a friendly network among women who use R!
Amy also mentors PhD and master students on their quantitative dissertations. She enjoys the teaching aspect of doing Data Science.
Relevance to Conference Goals
The tutorial is relevant and touches these areas of the conference theme.
1. Communication and Collaboration
No more emailing .R scripts! An R package gives an easy way to distribute code to others. Especially if you put it on GitHub.
2. Consistent documentation
I can barely remember what half of my functions do let alone the inputs and outputs. An R package provides a great consistent documentation structure and actually encourages you to document your functions.
3. Code Organization and reproducibility
Are you trying to figure out where that “function” you wrote months, weeks, or even days ago? Often times, people in statistics end up just re-writing it because it is faster than searching all the .R files. An R package would help in organizing where your functions go.
Sat, Feb 16
2:00 PM - 4:00 PM
Magazine
T4 -
Simulation Design and Reporting with Applications to Drug Development
Fill out evaluation
Tutorial
Instructor(s): Greg Cicconetti, AbbVie; Inna Perevozskaya, GlaxoSmithKline
Simulation methods have become an increasingly important tool in the search for more efficient clinical trial designs and/or statistical analysis procedures. During our short course we will provide a road map to developing and executing a successful simulation plan and communicating these results with a broader team. We will begin with a survey of problems one might encounter during the design, monitoring and analysis stages of a clinical trial for which a simulation study may provide some insight. We continue with an introduction to standard methods for generating random data. This discussion will include methods to mimic real-world data that do not adhere to standard statistical distributions, methods to introduce correlation among endpoints, parametric and non-parametric bootstrapping techniques, and the use of historic data to simulate future data. Having established this foundation, we return to some of our motivating problems and discuss their simulation-based solutions in greater depth. Though extensive R code will be provided to supplement this tutorial, our emphasis will be on the important concepts and principles of good simulation design and reporting.
Outline & Objectives
Tentative Course Outline: a subset of topics may be replaced with more contemporary materials
• Welcome and introduction
• Some motivation for simulation
• Modeling randomness
• Enrollment modeling
• Simulating correlated data
• An application using simulated correlated endpoints
• Leveraging historic data to aide in simulation
• Case study: Robustness of efficacy to early withdrawers in an outcomes study
• Case Study: Recurrent events
• Simulation Size – How large is large?
• Closing remarks
Course Objectives:
• Provide an introduction to statistical simulation
• Contrast theory and iterative problem solving
• Demonstrate simulation concepts via examples
• Simulation planning
• Communicating & drawing inferences from simulation
• Focus is not on coding and syntax or deep theory
About the Instructor
Greg Cicconetti, Ph.D., Statistical Innovations, Data and Statistical Sciences, AbbVie. Greg began his career as an assistant professor of statistics at Muhlenberg College before joining the pharmaceutical industry in 2005. In his roles at GlaxoSmithKline and AbbVie, Greg has gained extensive experience in survival and longitudinal trials, Bayesian methodology, and statistical learning. He has used simulation to guide teams regarding trial design, monitoring, and sensitivity analyses. In his current position Greg assists study teams in determining decision criteria to be used at interim analyses, effectively marrying simulation and visualization to build team consensus. Portions of the planned course material were delivered at the 2014 Deming Conference and also used in the graduate level Advanced Statistical Computing course at Drexel University taught by Greg in 2015. Greg is also a member of the DIA Scientific Working Group on Adaptive Designs and has participated in the development of a manuscript, along with other industry experts, advocating best practices in simulation reporting.
Relevance to Conference Goals
While this course is intended to be an introduction to simulation design and reporting, the attendee will be exposed to new statistical methodologies currently being employed to support on-going trials. Our discussion on simulation reporting will emphasize the importance of clearly articulating one's simulation design and summarizing pertinent simulation output in a way that facilitates collaboration with multiple stakeholders. Although we will use drug development and clinical trial design as a backdrop for explaining important simulation concepts, the core ideas presented should readily translate to those in other fields.
Sat, Feb 16
4:15 PM - 5:30 PM
Jackson
GS2 - Closing General Session
General Session
Chair(s): Kim Love, K. R. Love Quantitative Consulting and Collaboration
The Closing Session is an opportunity for you to interact with the CSP Steering Committee in an open discussion about how the conference went and how it could be improved in future years. CSPSC vice chair, Kim Love, will lead a panel of committee members as they summarize their conference experience. The audience will then be invited to ask questions and provide feedback. The committee highly values suggestions for improvements gathered during this time. The best student poster will also be awarded during the Closing Session, and each attendee will have an opportunity to win a door prize.