Viewing session type: Tutorial
2:00 PM - 4:00 PM
Increasingly complex observational studies are commonplace in a numerous data science settings, including biomedical, health services, pharmaceutical, insurance and online advertising. To adequately estimate causal effect sizes, proper control of known potential confounders is critical. Having gained enormous popularity in the recent years, propensity score methods are powerful and elegant tools for estimating causal effects. Without assuming prior knowledge of propensity score methods, this short course will use simulated and real data examples to introduce and illustrate important techniques involving propensity scores, such as weighting, matching and sub-classification. Relevant R and SAS software packages for implementing data analyses will be discussed in detail. Specific topics to be covered include guidelines on how to construct a propensity score model, create matched pairs for binary group comparisons, assess baseline covariate balance after matching and use inverse propensity score weighting techniques. Illustrative examples will accompany each topic and a brief review of recent relevant developments and their implementation will also be discussed.
- Observational Studies: definition, examples, causal effects, confounding control.
- Propensity Scores: definition, properties, modeling techniques.
- Propensity Score Approaches in Observational Studies: weighting, matching, sub-classification; graphical methods to assess covariate balance after matching;
- Illustration of these techniques using R packages MatchIt, Matching and optmatch, as well as SAS PROCs CAUSALTRT and PSMATCH.
- Guidelines on how to best describe the methodology utilized and the results obtained when presenting to a non-technical audience.
- Brief review of most recent methods developments and discussion of their potential for immediate use in practice.
Objectives: The first objective is to provide an example-centered overview of the most commonly used propensity score-based methods in observational studies. The second objective is to present the practical implementation of these methods and highlight the newly developed SAS PROCs CAUSALTRT and PSMATCH. The third objective is to discuss the advantages and disadvantages associated with these methods.
Dr. Andrei received a Ph.D. degree in Biostatistics from the University of Michigan in 2005. He is currently an Associate Professor in the Department of Preventive Medicine at Northwestern University, where he enjoys successful collaborations in cardiovascular outcomes research. He has developed expertise in MSMs and published relevant studies in adult cardiac surgery. He has developed practice-inspired and -oriented statistical methods in survival analysis, recurrent events, group sequential monitoring methods, hierarchical methods, and predictive modeling. In the last 15 years, Dr. Andrei has collaborated with medical researchers in fields such as pulmonary/critical care, organ transplantation, nursing, prostate and breast cancer, anesthesiology and thoracic surgery. Currently, he serves as Statistical Co-Editor of the Journal of the American College of Surgeons and deputy Statistical Editor of the Journal of Thoracic and Cardiovascular Surgery.
Upon attending this short-course course, participants will gain familiarity with propensity score-based methods for estimating causal effects in observational studies. Implementation in R and SAS software will be covered in detail, which will permit participants to integrate these useful data science techniques into their professional activities and projects. Learning how to produce simple yet powerful graphics to assess the propensity score model adequacy, check covariate balance and display the results, will undoubtedly benefit every participant. By leveraging their enhanced set of skills, individuals across industries will be adequately positioned to become more effective communicators in their interactions with customers and clients. Continued professional development is key to one’s career growth and can enhance the overall analytical capabilities within their respective organizations and institutions.
2:00 PM - 4:00 PM
This tutorial covers the data science process using R as a programming language and Spark as a big-data platform. Powerful workflows are developed for data extraction, data tidying and transformation, data modeling, and data visualization.
During the course R-based examples show how data is transported from data sources into the Hadoop Distributed File System (HDFS), into relational databases, and directly into Spark's real-time compute engine. Workflows using `dplyr' verbs are used for data manipulation within R, within relational databases (PostgreSQL), and within Spark using `sparklyr'. These data-based workflows extend into machine learning algorithms, model evaluation, and data visualization.
The machine learning algorithms include supervised techniques such as linear regression, logistic regression, gradient-boosted trees, and random forests. Feature selection is done primarily by regularization and models are evaluated using various metrics. Unsupervised techniques include k-means clustering and dimension reduction.
Big-data architectures are discussed including the Docker containers used for building the short-course infrastructure called RSpark.
1. Fundamentals: Linux; RSpark; RStudio; Git; Data Science Process [20 min]
2. Data Sources: Text; JSON; PostgreSQL; Web [20 min]
3. Data Transformation: Data Cleaning; `tidyr'; `dplyr' [20 min]
4. Hadoop: HDFS as a Persistent Data Store for Spark [30 min]
5. `sparklyr': Spark DataFrames; `dplyr' Interface [30 min]
6. Supervised Learning: Regression and Classification Workflows with Spark [60 min]
7. Unsupervised Learning: Dimension Reduction and Clustering with Spark [30 min]
The first three modules will not be covered in detail since the focus is the last four. However, the content in modules 1--3 contain critical information for understanding the latter modules.
The objectives of this course are to:
• extract static and streaming data from data sources,
• transform data into structured form,
• load data into relational and persistent, distributed data stores,
• build models using machine learning algorithms,
• validate and test models based on evaluation metrics,
• visualize big data and model metrics.
E. James Harner is Professor Emeritus of Statistics and Adjunct Professor of Business Data Analytics at West Virginia University. He was the Chair of the Department of Statistics for 17 years and the Director of the Cancer Center Bioinformatics Core for 15 years. Currently, he is the Chairman of the Interface Foundation of North America which has partnered with the American Statistical Association to organize the annual Symposium on Data Science and Statistics (SDSS). The areas of his technical and research expertise include: bioinformatics, high-dimensional modeling, high-performance computing, streaming and big data modeling, and statistical machine learning.
This course is based on a two-day workshop developed for the National Institute of Statistical Sciences (NISS): https://www.niss.org. The two-day version has been successfully taught three times (at ASA headquarters and at UC Riverside in September, 2017 and at the U. of Toronto in April, 2018). A one-day version of this course will be taught at the Symposium on Data Science and Statistics in May, 2018 and at the Joint Statistical Meeting in August/September, 2018.
Unlike many data science short courses, RSpark provides big-data platforms, i.e., R, Hadoop and Spark and their ecosystems. This is difficult for most instructors since the infrastructure is difficult to build. Thus, attendees will get a realistic taste of what data science really is.
The full data science process is taught, but the focus is on machine learning and the underlying R code. What is taught is a realistic representation of what is done in practice.
Communication of results is done through reproducible reports and data visualizations, which are often the endpoints of pipelines in R and Spark. Collaboration is prinarily done using Git and GitHub although code sharing within RStudio is also discussed. Data science in practice is almost always a team effort and parts of this collaboration are taught.
This course offers a unique opportunity for professional development since a real data science platform is used. It is possible to scale RSpark using container orchestration, but the containers used within this course are essentially indistinguishable from a production environment.
2:00 PM - 4:00 PM
It is time to take the next step and start wrapping all your utility functions, that are scattered across numerous .R files, into R packages to help with code organization, distribution, and consistent documentation.
In this hands-on tutorial, I will introduce step-by-step how to build your very own R package. If you've used R, you've almost certainly used a package - but did you know that building your own package is actually not hard at all? If you have written bits of useful code you want to keep and return to, you might want a package.
After this session, participants will have the skills to start a package and document their functions, and resources to use for next steps like vignettes and unit testing. During the tutorial, participants can follow along using provided scripts.
This hands-on tutorial includes the following sections:
1. Setup R and install required packages
2. Create the framework for your package
3. Add functions to the package
4. External dependencies
6. Install and use your package
7. (Bonus) Distribute your package on GitHub
After this session, participants will have the skills to start a package and document their functions, and resources to use for next steps like vignettes and unit testing. During the tutorial, participants can follow along using provided scripts.Amy Yang is a Sr. Data Scientist at Uptake where she conducts industrial analytics and build prediction models to major industries and help them increase productivity, security, safety and reliability.She began using R for simulation and statistical analysis during her study at the University of Pennsylvania where she received her MS degree in Biostatistics. She also teaches R programming and statistical courses for graduate students. You can find her on twitter @ayanalytics
Outside of work, Amy co-organizes the Chicago RLadies meetup group where she helps promoting R, inviting women speakers from different data science fields to give talks. Her goal is to create a friendly network among women who use R!
Amy also mentors PhD and master students on their quantitative dissertations. She enjoys the teaching aspect of doing Data Science.
The tutorial is relevant and touches these areas of the conference theme.
1. Communication and Collaboration
No more emailing .R scripts! An R package gives an easy way to distribute code to others. Especially if you put it on GitHub.
2. Consistent documentation
I can barely remember what half of my functions do let alone the inputs and outputs. An R package provides a great consistent documentation structure and actually encourages you to document your functions.
3. Code Organization and reproducibility
Are you trying to figure out where that “function” you wrote months, weeks, or even days ago? Often times, people in statistics end up just re-writing it because it is faster than searching all the .R files. An R package would help in organizing where your functions go.
2:00 PM - 4:00 PM
Simulation methods have become an increasingly important tool in the search for more efficient clinical trial designs and/or statistical analysis procedures. During our short course we will provide a road map to developing and executing a successful simulation plan and communicating these results with a broader team. We will begin with a survey of problems one might encounter during the design, monitoring and analysis stages of a clinical trial for which a simulation study may provide some insight. We continue with an introduction to standard methods for generating random data. This discussion will include methods to mimic real-world data that do not adhere to standard statistical distributions, methods to introduce correlation among endpoints, parametric and non-parametric bootstrapping techniques, and the use of historic data to simulate future data. Having established this foundation, we return to some of our motivating problems and discuss their simulation-based solutions in greater depth. Though extensive R code will be provided to supplement this tutorial, our emphasis will be on the important concepts and principles of good simulation design and reporting.
Tentative Course Outline: a subset of topics may be replaced with more contemporary materials
• Welcome and introduction
• Some motivation for simulation
• Modeling randomness
• Enrollment modeling
• Simulating correlated data
• An application using simulated correlated endpoints
• Leveraging historic data to aide in simulation
• Case study: Robustness of efficacy to early withdrawers in an outcomes study
• Case Study: Recurrent events
• Simulation Size – How large is large?
• Closing remarks
• Provide an introduction to statistical simulation
• Contrast theory and iterative problem solving
• Demonstrate simulation concepts via examples
• Simulation planning
• Communicating & drawing inferences from simulation
• Focus is not on coding and syntax or deep theory
Greg Cicconetti, Ph.D., Statistical Innovations, Data and Statistical Sciences, AbbVie. Greg began his career as an assistant professor of statistics at Muhlenberg College before joining the pharmaceutical industry in 2005. In his roles at GlaxoSmithKline and AbbVie, Greg has gained extensive experience in survival and longitudinal trials, Bayesian methodology, and statistical learning. He has used simulation to guide teams regarding trial design, monitoring, and sensitivity analyses. In his current position Greg assists study teams in determining decision criteria to be used at interim analyses, effectively marrying simulation and visualization to build team consensus. Portions of the planned course material were delivered at the 2014 Deming Conference and also used in the graduate level Advanced Statistical Computing course at Drexel University taught by Greg in 2015. Greg is also a member of the DIA Scientific Working Group on Adaptive Designs and has participated in the development of a manuscript, along with other industry experts, advocating best practices in simulation reporting.
While this course is intended to be an introduction to simulation design and reporting, the attendee will be exposed to new statistical methodologies currently being employed to support on-going trials. Our discussion on simulation reporting will emphasize the importance of clearly articulating one's simulation design and summarizing pertinent simulation output in a way that facilitates collaboration with multiple stakeholders. Although we will use drug development and clinical trial design as a backdrop for explaining important simulation concepts, the core ideas presented should readily translate to those in other fields.