In medical imaging, inter and intra-rater agreement measures provide useful means of assessing the reliability of a rating system, which is important in disease diagnosis. Our research was motivated by a study evaluating classification methods for chest radiographs for Pneumoconiosis developed by the International Labor Office in Geneva. The same subjects were evaluated by multiple readers twice using different formats. Focus was on comparing intra-reader reliability of these formats, which, due to the sampling design are correlated. Earlier work in this area dealt with the problem under the scenario that the readers are homogeneous. Our modification offers a Bayesian approach avoiding such simplified assumptions. Simulation studies showed that our model outperforms the frequentist methods in terms of type I error and power even when the rating probabilities differ moderately. We further developed a Bayesian model for comparing dependent agreement measures adjusting for subject and rater-level heterogeneity. We adopted a joint analysis that alleviates potential bias stemming from the two-stage method.