I was looking at some reviewer ratings of proficiency test items the other day and found something interesting. Expert raters were asked to classify items according to their proficiency level (e.g., this item is appropriate for a person with proficiency level A, B, or C). The details of the items and the proficiency scale are not important, other than the fact that some items were reading, others were listening.
For both skills, the reviewers had moderate inter-rater reliability. In other words, reviewers generally agreed with each other in terms of the categorization of the items. However, the agreement between the test developer's intended level of the item and the reviewers' average rating were much higher in one skill than in the other.
This creates a bit of an interpretational conundrum. If the divergence in agreement were seen across both skills, one might deduce that the reviewers, while consistent, were too harsh (or lenient, depending on the direction of the divergence) in their application of the scale. Alternatively, it could be that the item writers were overly harsh (or lenient) in targeting their items. If different item writers worked on different skills, that could provide an explanation as well.
And, of course, there are more mundane explanations as well. With a small number of categories, there are only so many ways that one can be different, so perceived agreement could be illusory.
I don't have any information on the qualifications of the item writers or reviewers, details of training provided, etc., so it is impossible to identify the source of the disagreement.
The take away is that there is no statistical test for ground truth.
Saturday, March 24, 2018
Subscribe to:
Post Comments (Atom)

No comments:
Post a Comment