Machine learning models rely on high-quality input data, for example, images labelled as dogs vs cats or text labelled as positive or negative sentiment. The instruments used to collect these labels are similar to web surveys, except that the questions are about images or text rather than about the labelers themselves. Our study tests whether the principles of data quality in web surveys also apply to the collection of labels for machine learning models.
We fielded two versions of an instrument to code the sentiment of tweets. All tweets have been previously coded and thus gold-standard labels exist. By comparing the labels collected in the two versions, we provide the first evidence that instrument design matters in the collection of labels for data science. We also investigate annotator-effects, drawing a parallel to interviewer effects in the survey literature. Our results will interest data scientists who want to save time and money by collecting high quality labels.
|