Measuring the reliability of observational data: a reactive process.
Announcing a reliability check inflates observer agreement, so collect some data without telling your coders.
01Research in Context
What this study did
Wildemann et al. (1973) asked a simple question: does telling observers you will check their agreement change how they code? They ran two conditions. In one, observers knew a second coder would compare notes. In the other, the check stayed hidden.
The team then looked at how often the two coders matched. They wanted to see if the simple act of announcing a reliability check skewed the numbers.
What they found
When observers knew they were being checked, their agreement scores shot up. The covert condition gave lower numbers. The authors argue the lower figures are the honest ones.
In short, the warning cue alone pushed coders toward the known assessor's data.
How this fits with other research
Harris et al. (1978) extends the warning. They showed that even when reliability checks stay in place, watching a teacher hand out praise can nudge observers to over-count eye contact. Three of six coders inflated scores after seeing rewards delivered.
Palmer et al. (2018) echoes the same theme with college students. Just having an experimenter in the room cut off-task behavior in half. Both papers shout the same message: measurement itself changes what you measure.
Fahmie et al. (2013) built a new autism scale and reported decent inter-rater reliability. Yet that claim is open to the same reactivity G et al. exposed; if the coders knew a check was coming, their agreement may be artifically rosy.
Why it matters
Next time you train a new RBT to collect data, run covert reliability trials at unannounced times. Rotate second observers quietly into sessions. Report both the open and hidden agreement numbers in your treatment reports. This small step keeps your data honest and your clinical decisions solid.
Want CEUs on This Topic?
The ABA Clubhouse has 60+ free CEUs — live every Wednesday. Ethics, supervision & clinical topics.
Join Free →Schedule one silent reliability session this week: have a second observer code without the primary therapist knowing.
02At a glance
03Original abstract
Reliability of observational data was measured simultaneously by two assessors under two experimental conditions. During overt assessment, observers were told that reliability would be measured by one of the two assessors, thus permitting computation of reliability with an identified and an unidentified assessor. During covert assessment, observers were not informed of the reliability measured. Throughout the study, each of the assessors employed a unique version of a standard observational code. In the overt assessment condition, reliability of observers with the identified assessor was consistently higher than reliability with the unidentified assessor, indicating that observers modified their observational criteria to approximate those of the identified assessor. In the covert assessment condition, reliability with the two assessors was substantially lower than during overt assessment. Further, observers consistently recorded lower frequencies of disruptive behavior than the two assessors during covert assessment.
Journal of applied behavior analysis, 1973 · doi:10.1901/jaba.1973.6-175