The phrase "big data" has become widespread, but what does it mean for the practicing healthcare analyst? How does the presence of large dimensional data impact the actual workflow of conducting analytics in health care/health policy? In this course, participants will gain experience using cutting edge software tools for big data analysis, with a focus on Python and Apache Spark. We will begin with an overview of the challenges to making inference in the presence of high dimensional data. This will lead to a discussion of recent software solutions in this space. Through instructor led examples, we will next discuss and demonstrate the efficiency of various analytic frameworks. We will differentiate online learning approaches from distributed optimization approaches. Various examples of dimensionality reduction for data in the pre-modeling environment will be covered. We will contrast traditional serial optimization approaches (such as Newton Raphson) with parallel optimization approaches (such as stochastic gradient descent). Students will be provided with code to run all models presented at the workshop, thus no experience in these languages is required. All software used will be open source; students will be able to set up their computing environment prior to the workshop.