All Times EDT
Poster Q&A will be available during these designated hours as part of the virtual conference.
Poster Q&A will be available during these designated hours as part of the virtual conference.
In this mini-workshop, we discuss some of the social and ethical challenges of statistical and machine learning algorithms with a panel of experts from academia and industry.
R is a flexible, extensible statistical computing environment, but it is limited to single-core execution. Spark is a distributed computing environment that treats R as a first-class programming language. This course introduces data structures in R and their use in functional programming workflows relevant to data science.
The course covers the initial steps in the data science process: - extracting data from source systems, - transforming data into a tidy form, - loading data into distributed file systems, distributed data warehouses, and NoSQL databases, i.e., ETL.
These R-based workflows are illustrated by using dplyr directly and as a frontend to SQL databases. The sparklyr package with its dplyr interface to Spark is then used for modeling big data using regression and classification supervised learning methods. Unsupervised learning methods, such as clustering and dimension reduction, are also covered. Finally, methods for analyzing streaming data are presented. Student accounts are provided to allow attendees to interactively run the R Markdown content in Amazon’s cloud (AWS). The computing infrastructure and the content is containerized which allows the complete course environment to be downloaded and run on Docker-supported laptops.
Big datasets (many rows, many columns, many items, ...) present special problems for visualization. Even when trying to plot simple rectangular datasets, we encounter complexity (many functions are polynomial or exponential in rows or columns), the curse of dimensionality (distances approach a constant as dimensionality heads toward infinity), choke points (data bus or network bandwidth), and limited display resolution (even with megapixel displays). This workshop covers recent strategies that exploit aggregation and projection to reduce datasets to manageable proportions. It also covers graphic representations that are most suitable for exploring multivariate data.