Online Program Home
My Program

Abstract Details

Activity Number: 294 - SPEED: Statistical Learning and Data Science Speed Session 2, Part 1
Type: Contributed
Date/Time: Tuesday, July 30, 2019 : 8:30 AM to 10:20 AM
Sponsor: Section on Statistical Learning and Data Science
Abstract #304653 Presentation 1 Presentation 2
Title: Ground Truth? Understanding How Humans Label Records and the Impact of Uncertainty
Author(s): Kayla Frisoli* and Rebecca Nugent
Companies: Carnegie Mellon University and Carnegie Mellon University
Keywords: record linkage; statistical learning; R shiny; human-data interaction; crowdsourced data; census data

When tackling large-scale statistical learning problems, it’s helpful to have ground truth labels to either build supervised models or assess model performance. But, we often take the quality of these labels for granted. We rarely question where our labels came from, how they were generated, or how uncertainty in labels may impact our research. In the age of easily accessible online crowdsourcing, we can generate more labels from more labelers. This provides an opportunity to study how humans decide to label data and the impact of this subjective process on subsequent modeling. As part of a recent record linkage project, we use an R Shiny application to collect nested labels that link 1901 Ireland census households and individuals to their (potential) 1911 counterparts. During the collection process, we track how people interact with and make decisions about the records themselves. We study the impact of this process and explore how that additional information can be incorporated directly into record linkage models. We argue that researchers should be more cognizant of the impact of the human decision-making process and, when applicable, adjust models accordingly.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2019 program