Keywords: Visualization, Python, Matplotlib, Pandas, Seaborn, R, ggplot2, graphics, EDA
Visualization, for both data exploration and presentation of results, is an essential in any statistician’s toolbox. Most statisticians use one or more visualization software packages. However, there is little awareness in the statistical community of the visualization capability available in Python. This tutorial presentation will familiarize statisticians with the powerful visualization tools available in Python.
Many statisticians are familiar with R, including ggplot2 for visualization. We will use these packages as a reference point to introduce attendees to the parallel capabilities in Python. The focus will be on the most commonly used Python scientific visualization packages, Matplotlib, Pandas, and Seaborn.
The basis of scientific visualization in Python is the low level Matplotlib package. Other Python statistical visualization packages are built on top of Matplotlib. Thus, there is considerable consistency through the Python scientific visualization stack. A basic understanding of Matplotlib allows users to perform significant customization of plots made with other Python visualization packages. Both the Pandas plotting methods and the Seaborn package are built on Matplotlib, allowing for significant extension and customization.
The Pandas data frame package provides powerful tools to manage and transform raw data. Much of the functionality in this package will be familiar to R data frame users, but Pandas includes some powerful extensions. Additionally, Pandas includes some useful exploratory visualization methods.
The Seaborn package is a powerful and innovative statistical visualization package. Seaborn has many capabilities found in ggplot2, but is currently not as extensive. However, Seaborn does includes some interesting innovations.
The presentation will be a live demonstration using a Jupyter notebook. The notebook, including all code, will be made publicly available.