Abstract:
|
When tackling large-scale statistical learning problems, it’s helpful to have ground truth labels to either build supervised models or assess model performance. But, we often take the quality of these labels for granted. We rarely question where our labels came from, how they were generated, or how uncertainty in labels may impact our research. In the age of easily accessible online crowdsourcing, we can generate more labels from more labelers. This provides an opportunity to study how humans decide to label data and the impact of this subjective process on subsequent modeling. As part of a recent record linkage project, we use an R Shiny application to collect nested labels that link 1901 Ireland census households and individuals to their (potential) 1911 counterparts. During the collection process, we track how people interact with and make decisions about the records themselves. We study the impact of this process and explore how that additional information can be incorporated directly into record linkage models. We argue that researchers should be more cognizant of the impact of the human decision-making process and, when applicable, adjust models accordingly.
|