Online Program

Return to main conference page
Keynote Presentation | Concurrent Sessions | Poster Sessions
Short Courses (full day) | Short Courses (half day) | Tutorials | Practical Computing Demos | Closing General Session with Refreshments

Last Name:

Abstract Keyword:

Title:

     
Thursday, February 15
SC5 Cleaning Up the Data Cleaning Process: Challenges and Solutions in R Thu, Feb 15, 8:00 AM - 12:00 PM
Salon D
Instructor(s): Claus Thorn Ekstrøm, Biostatistics, University of Copenhagen; Anne Helby Petersen, Biostatistics, University of Copenhagen
Data cleaning and validation are the first steps in any data analysis, as the validity of the conclusions from the analysis hinges on the quality of the input data. Mistakes in the data can arise for any number of reasons, including erroneous codings, malfunctioning measurement equipment, and inconsistent data generation manuals. We present a systematic, analytical approach to data cleaning that will ensure the data cleaning process to be just as structured and well-documented as the rest of the data analysis. The primary software tool is the dataMaid R package, which implements an extensive and customisable suite of quality assessment tools that can be used to identify potential problems in a dataset. The results are summarised in an auto-generated, non-technical, stand-alone document readable by statisticians and non-statisticians alike. Thus, the course teaches practical skills that aid the dialogue between data analysts and field experts, while also providing easy documentation of reproducible data cleaning steps and data quality control.