Sequential Testing for Intraclass Correlation Coefficient in Inter-Rater Reliability Studies
Zhen Chen, National Institute of Child Health and Human Development 
*Mei Jin, George Washington University 
Zhaohai Li, George Washington University 
Aiyi Liu, National Institute of Child Health and Human Development 

Keywords: Interim analysis; intraclass correlation coefficient; inter-rater reliability; sample size and power; two-way ANOVA.

Inter-rater reliability is usually assessed by means of the intraclass correlation coefficient. Using two-way analysis of variance to model raters and subjects as random effects, we derive group sequential testing procedures for the design and analysis of reliability studies in which multiple raters evaluate multiple subjects. Compared with the conventional fixed sample procedures, the group sequential test has smaller average sample number. The performance of the proposed technique is examined using simulation studies and critical values are tabulated for a range of two-stage design parameters. The methods are exemplified using data from the Physicians' Reliability Study for diagnosis of endometriosis.

Motivated by the idea of sequential testing that is widely used in clinical trials, it is natural to adopt and extend these sequential testing methods in the design and analysis of reliability studies to reduce the sample size and study cost. In reliability studies evaluating the measurement error by applying the one-way ANOVA model, the multistage group sequential designs were proposed. Under one-way ANOVA, the sums of squares in the estimation of the intraclass correlation coefficient possess independent increments, thus simplifying the calculation of stopping boundaries (Liu, Schisterman, and Wu, 2006).

In this paper, we develop multistage testing procedures using two-way ANOVA for hypothesis concerning the intraclass correlation coefficient in a inter-rater reliability study. In Section 2, we state the hypotheses of interest, introduce the structure and assumption of the two-way ANOVA models, and propose the simulation designs for the one-stage problem for sample size and power calculation. In Section 3, we develop methods to determine critical values, sample size, and power using Lan and DeMets's error spending approach. Realizing that the between-rater sum of squares violates the independent increments assumption, we develop simulation techniques to effectively calculate the critical values. The performance of the proposed methods is examined in Section 4 using simulation studies and critical values are tabulated for a range of two-stage design parameters. In Section 5, we exemplify the methods using data from the Physicians's Reliability Study for diagnosis of endometriosis. Finally, summary and discussion are given in Section 6.