Prerequesite: Analysis of Complex Health Survey Data, Part 1
Format: One 2-hour session, including practicum. Laptop is strongly recommended.
Target Audience: Epidemiologists, data scientists, informaticians, data analysts, and statisticians.
The phrase “big data” has become widespread, but what does this mean for the practicing healthcare analyst? How does the presence of big data impact the actual workflow of a practicing analyst in health care? In this workshop, attendees will be exposed to multiple tools useful in the analysis of big healthcare data, including Python, SQL, Hadoop, and Spark. The workshop will consist of a lecture/discussion of these technologies, and then practical examples with code. Students will have the opportunity to follow along and run code throughout the workshop. Through instructor led examples, we will discuss and demonstrate the efficiency of various analytic frameworks for a binary classification problem using synthetic EMR data. We will begin with examples of managing data in SQL and alternatively in a NoSQL environment. Various examples of dimensionality reduction for data relevant to healthcare in the pre-modeling environment will be covered. We will further consider more complex dimensionality reduction techniques requiring an analytic platform beyond a simple database management system. In order to explore different approaches to a classification problem, penalized regression (LASSO), Random Forests, and support vector machines (SVM) will be presented. We will contrast traditional serial optimization approaches (such as Newton Raphson) with parallel optimization approaches (such as stochastic gradient descent). Students will be provided with code to run all models ahead of the workshop, thus no experience in these languages is required. All software used will be open source; students will be expected to set up their computing environment prior to the workshop, further details and guidance will be sent to attendees.
Specific Learning Objectives: • Understand options for managing large healthcare data sources in the pre-modeling environment. • Understand the difference between traditional RDBMS and NoSQL alternatives. • Understand and define the differences between the model, loss function, regularizer, and optimization. • Understand serial versus parallel model optimization techniques and the implications for practical approaches to analysis in the presence of big data. • Understand the impact of increasing dimensionality on different analytic approaches. • Gain a basic understanding of fitting models in Python • Gain a basic understanding of fitting models in Apache Spark (using the Python API)