Viewing session type: Short Course (full day)
Back to search menu
Thursday, February 14
Thu, Feb 14
8:00 AM - 5:30 PM
Commerce
Instructor(s): Frank Harrell, Vanderbilt University
All standard regression models have assumptions that must be verified for the model to have power to test hypotheses and for it to be able to predict accurately. Of the principal assumptions (linearity, additivity, distributional), this course will emphasize methods for assessing and satisfying the first two. Practical but powerful tools are presented for validating model assumptions and presenting model results. This course provides methods for estimating the shape of the relationship between predictors and response using the widely applicable method of augmenting the design matrix using restricted cubic splines. Even when assumptions are satisfied, overfitting can ruin a model's predictive ability for future observations. Methods for data reduction will be introduced to deal with the common case where the number of potential predictors is large in comparison with the number of observations. Methods of model validation (bootstrap and cross-validation) will be covered, as will auxiliary topics such as modeling interaction surfaces, variable selection, overly influential observations, collinearity, and shrinkage, and a brief introduction to the R rms package for handling these problems. The methods covered will apply to almost any regression model, including ordinary least squares, logistic regression models, ordinal regression, quantile regression, longitudinal data analysis, and survival models.
Outline & Objectives
1. Introduction; Advantages of prediction over classication
2. Hypothesis Testing vs. Estimation vs. Prediction vs. Classication
3. How Many Degrees of Freedom does a Data Mining Procedure Ac-
tually Have?
4. Regression Model Notation
5. Model Formulations
6. Interpreting Model Parameters
(a) Nominal Predictors
(b) Interactions
7. Relaxing Linearity Assumption for Continuous Predictors
(a) Categorization is not an alternative
(b) Simple Nonlinear Terms
(c) Splines for Estimating Shape of Regression Function and Deter-
mining Predictor Transformations
(d) Cubic Spline Functions
(e) Restricted Cubic Splines
(f) Choosing Number and Position of Knots
(g) Nonparametric smoothers and regression trees
(h) Advantages of Splines over Other Methods
8. Multiple Degree of Freedom Tests of Association
9. Assessment of Model Fit
(a) Regression Assumptions
(b) Modeling and Testing Interactions
10. Multivariable Modeling Strategy
(a) Why and How To Pre-specify Model Complexity
(b) Problems Caused by Ordinary Stepwise Variable Selection
(c) Collinearity
(d) Shrinkage
(e) Data Reduction
(f) Overly In
uential Observations
(g) Some Useful Modeling Strategies for
i. Prediction
ii. Estimation
iii. Hypothesis Testing
11. Overview of the Bootstrap
12. Model Validation
(a) Cross-validation
(b) Bootstrap
13. Graphical Methods for Interpreting Complex Regression Fits
14. Detailed Case Studies
(a) Generalized Least Squares for Serial Data
(b) Ordinal Regression for Continuous Y : Predicting glycohemoglobin
(and pre-diabetes) from body size characteristics using NHANES
data
(c) Binary Logistic Regression: Survival Patterns of Passengers on
the Titanic
(d) Survival Modeling
A more detailed outline is available at biostat.mc.vanderbilt.edu/rms.
About the Instructor
Dr. Harrell is Professor of Biostatistics, Founding Chair of the
Department of Biostatistics of Vanderbilt University School of Medicine, and Expert Statistical Advisor, Oce of Biostatistics, Center for Drug
Evaluation and Research, US FDA. Prior to starting the new department
in 2003 he was Chief of the Division Biostatistics and Epidemiology in the
Department of Health Evaluation Sciences, University of Virginia School
of Medicine. Prior to coming to the University of Virginia in 1996 he
was in the Division of Biometry at Duke University Medical Center for 17
years. He received his Ph.D. in biostatistics from the University of North
Carolina, Chapel Hill in 1979, where he studied under P.K. Sen. Dr.
Harrell's interests include statistical modeling and model validation, sta-
tistical computing and graphics, reproducible research, survival analysis,
clinical trials, health services and outcomes research, medical diagnostic
and prognostic models, bootstrapping, missing data, and Bayesian mod-
eling. He is an associate editor of Statistics in Medicine, a member of
the editorial board for American Heart Journal, a member of Faculty of
1000 Medicine, on the editorial policy board for the Journal of Clinical
Epidemiology and a member of the Scientic Advisory Board, for Science
Translational Medicine. For many years he has been a consultant to FDA
and the pharmaceutical industry. He is author of the book Regression
Modeling Strategies, Second Edition (Springer, 2015) and teaches courses
in biostatistical modeling. He was the recipient of the American Statisti-
cal Association's WJ Dixon award for excellence in statistical consulting
in 2014.
Relevance to Conference Goals
This is an applied statistics course that teaches regression analysis and predictive modeling tools that have wide applicability, and should be of great value to almost all practicing statisticians.
Thu, Feb 14
8:00 AM - 5:30 PM
Royal
SC2 -
Big Data, Data Science, and Deep Learning for Statisticians
Fill out evaluation
Short Course (full day)
Instructor(s): Ming Li, Amazon; Hui Lin, Netlify
With recent big data, data science and deep learning revolution, enterprises ranging from FORTUNE 100 to startups across the world are hungry for data scientists and machine learning scientists to bring actionable insight from the vast amount of data collected. In the past a couple of years, deep learning has gained traction in many application areas and it becomes an essential tool in data scientist’s toolbox. In this course, students will develop a clear understanding of the big data cloud platform, technical skills in data sciences and machine learning, and especially the motivation and use cases of deep learning through hands-on exercises. We will also cover the “art” part of data science and machine learning to guide participants to learn typical agile data science project flow, general pitfalls in data science and machine learning, and soft skills to effectively communicate with business stakeholders. This course will prepare statisticians to be successful data scientists and deep learning scientist in various industries and business sectors.
Outline & Objectives
The big data platform, data science, and deep learning overviews are specifically designed for audience with statistics education background. The data science workflow, pitfalls and soft skills are highlight through real-world data science and machine learning problems. The Databricks community edition cloud platform will be used throughout the training course to cover hands-on sessions including: (1) big data platform using Spark through R sparklyr package; (2) introduction to Deep Neural Network, Convolutional Neural Network and Recurrent Neural Networks and their applications; (3) deep learning examples using TensorFlow through R keras package. The primary audiences for this course are: (1) statistician in traditional industry sectors such as manufacturing, pharmaceutical and banking; (2) statistician in government agencies; (3) statistical researchers in universities; (4) graduate students in statistics departments. The prerequisite knowledge is MS level education in statistics and entry level of R knowledge. No software installation is needed in students’ laptop and the cloud platform is easily accessed through browsers such as Chrome or Firefox with internet connection.
About the Instructor
Both instructors have Ph.D. in Statistics from Iowa State University and have worked in data science and machine learning areas for a number of years. Dr. Li is a Sr. Data Scientist at Amazon and Dr. Lin is a Data Scientist at Netlify. Before Amazon, Dr. Li was at Walmart, SAS and GE and he was the Chair of the 2017 Quality and Productivity Section of ASA. Dr. Lin was a leader at DuPont on applying advanced data science to enhance marketing and sales effectiveness and she is the co-founder of Central Iowa R User Group and blogger of scientistcafe.com. With deep statistics background and a few years’ industrial experiences in data science, they have trained and mentored numerous junior data scientist with diversified background. They have taught a similar continue education course without the deep learning part at the 2017 JSM, and they will teach similar courses at Joint Research Conference, ICSA Applied Statistics Symposium and Fall Technical Conference in 2018. Dr. Li organized and will present at the Introductory Overview Lecture “Leading Data Science: Talent, Strategy, and Impact” at the 2018 JSM. Dr. Li is also an Instructor of Amazon’s internal Machine Learning University.
Relevance to Conference Goals
This short course fit the conference goals very well. It focuses on Big Data and Data Science applications in real-world problems including the new development of deep learning. With the focus on the cloud platform, students can learn the current trend of data science software and big data infrastructure used by tech companies such that they can expand their programming scope to cover more applications in data science and machine learning. The short course also includes the needed soft-skill discussions to prepare students with better understanding of the data science project flow, pitfalls in machine learning and communication skills. This course keeps statistician’s background in mind to bridge the gaps between a traditional statistician and a successful data scientist. After taking the course, students will be confident to positively impact their organization by transforming their current traditional statistics team into a data science or machine learning team or to explore data scientist or machine learning scientist opportunities for their future career development.
Thu, Feb 14
8:00 AM - 5:30 PM
Canal
Instructor(s): Richard D. De Veaux, Williams College
This seminar is a practical introduction to and an overview of the techniques and strategies of data mining. While I will discuss the models in detail, the course will be application rather than theoretically oriented. Many of the standard techniques of data mining, including modern model selection strategies for multiple regression such as the lasso, elastic net etc will be presented. In addition we'll cover classification and regression trees, neural networks, principal components, Naïve Bayes, bagging, and boosting. The course will be problem solving based, using real case studies from science and industry to illustrate which methods work well, when and why. We will emphasize problem formulation, the challenges of the process and the communication that is necessary back to decision makers to effect maximum impact in the organization. No prerequisites other that a knowledge of the basics of regression are assumed. The applications will come from a wide variety of industries and include some applications from my personal experiences as a consultant for companies that deal with such topics as financial services, chemical processing, pharmaceuticals, and insurance.
Outline & Objectives
Outline:
I. Introduction to data mining a. What is data mining? b. What are the applications? c. How does it differ from statistics? 2. Formulating the problem a. Data considerations b. How to evaluate the methods c. Testing and training 3. The methods -- overview of the most commonly used algorithms use 4. Case studies a. In depth comparisons of the methods and how they helped solve the problem b. Challenges to communication 5. Summary
Learning Objectives
(a) Learning outcomes (performance objectives): In the process of analyzing the data sets, attendees will learn how to: • Identify appropriate problems for data mining • Learn how to explore and prepare data for mining • Use a variety of techniques including decision trees and neural nets to build accurate predictive models • Evaluate the quality of models • Select the appropriate data mining tools for applications. (b) Content and instructional methods: The presentation provides interaction between the participant and the material by involving audience participation in the data analyses.
About the Instructor
Richard De Veaux (Dick), Ph.D is C. Carlise and Margaret Tippit Professor of Statistics at Williams College. He holds degrees in Civil Engineering (B.S.E. Princeton), Mathematics (A.B. Princeton), Dance Education (M.A.) and Statistics (Ph.D.) at Stanford where he studied statistics with Persi Diaconis and dance with Inga Weiss. Dick has taught at the Wharton School and Princeton Universityand has been a visiting researcher at INRA in Montpellier and a visiting professor at Paris V. De Veaux has won numerous teaching awards from the Engineering Council at Princeton. He has won both the Wilcoxon and Shewell (twice) awards from the ASQ is a fellow of the ASA and an elected member of the ISI. In 2006-2007 he was the William R. Kenan Jr. Visiting Professor for Distinguished Teaching at Princeton. In 2008 he was named the Statistician of the Year by the Boston Chapter of the ASA. He has served on the Board of Directors of the ASA and is past chair of the Section on Statistical Learning and Data Science and is the 2019-2021 Vice President. Dick has been a consultant for over 30 years for such Fortune 500 companies as Hewlett-Packard, Alcoa, American Express, Bank One,GlaxoSmithKline.
Relevance to Conference Goals
Directly relevant to themes of practical issues in big data.