Abstract:
|
Crowdsourcing has emerged as an alternative solution for collecting large-scale data from nonexperts, for example, in medical diagnosis and natural language processing tasks. Instead of getting experts involved, crowdsourcing collects labels, answers or solutions from a crowd of workers through online platforms. However, majority of workers are not domain experts so that their contributed answers are noisy and not reliable to some extent. In this paper, we propose a two-stage model to infer the true labels for binary and multicategory classification tasks. In the first stage, we fit the observed labels with a latent factor model and incorporate group structures from both tasks and workers through a multi-directional separation penalty. For the multi-categorical case, we introduce a group-wise rotation to align the workers’ latent factors to different task categories. In the second stage, we infer true labels based on identified high-quality worker groups to improve prediction accuracy. In theory, we show the estimation consistency of latent factors and the classification consistency of the proposed method. The simulation and real data examples also favor the proposed method.
|