Abstract:
|
Expert raters are often used to quantify subjective input from a variety of scientific fields, including psychology, educational measurement, epidemiology, and statistics. Measurement error resulting from less-than-perfect inter-rater reliability is a concern to users of such data. To address this, precise estimates of inter-rater reliability are needed. In the Advanced Placement (AP) History exam, each essay is randomly assigned two raters who work independently and each assign an integer score (0-6) to the essay. No essay is graded by all raters, and raters may not be interchangeable (e.g., some might be more severe). In this paper, the rating results are viewed from a missing data viewpoint and a model-based multiple imputation procedure is used to improve the estimation of inter-rater reliability. This approach is compared empirically to other standard reliability measures using a dataset containing four independent grades per essay. To simulate an AP-type rating situation, grades are sampled down to two. Reliability estimates based on two versus four ratings per essay are then compared. This study may have important implications for practitioners who rely on expert ratings.
|
ASA Meetings Department
732 North Washington Street, Alexandria, VA 22314
(703) 684-1221 • meetings@amstat.org
Copyright © American Statistical Association.