Considerations in the choice of interobserver reliability estimates.
Pick percentage agreement for trial data and correlation for session totals, and always tell the reader which you used.
01Research in Context
What this study did
Rider (1977) wrote a think-piece, not an experiment. He compared two ways to check if two observers see the same thing.
One way is simple percentage agreement: both observers count, then you divide matches by total. The other way is correlation: you plot one observer’s counts against the other’s and see if the line slopes up.
He asked when each number tells the truth and when it fools you.
What they found
Percentage agreement works best when you score each single yes-no trial. It catches big misses but can hide bias if both observers are wrong the same way.
Correlation works best when you compare total counts across whole sessions. It shows if the observers move up and down together, but it can be high even when both miss lots of behavior.
No single number is perfect; you have to pick the one that fits your question.
How this fits with other research
Cook et al. (2020) later showed the same worry in momentary time sampling. They told us to spot-check with real duration data so drift does not hide behind a high correlation.
Newland (2024) moves the debate forward. He fixes the risk-ratio formula that many of us plug reliability numbers into, proving the 1977 warning still matters today.
Iwata et al. (1990) give a live demo: the SIT Scale hit 89-a large share agreement across items, proving tight trial-level checks are possible when the code is clear.
Why it matters
Next time you train staff, pick your check to match the grain of your data. Use quick percentage agreement for discrete trial sheets. Use correlation for summary sheets across days. Report both if you can. This old paper keeps you from bragging about a high number that hides sloppy watching.
Want CEUs on This Topic?
The ABA Clubhouse has 60+ free CEUs — live every Wednesday. Ethics, supervision & clinical topics.
Join Free →Open your last five IOA sheets and label each as trial-level or session-level, then swap in the matching index if you used the wrong one.
02At a glance
03Original abstract
Two types of interobserver reliability values may be needed in treatment studies in which observers constitute the primary data-acquisition system: trial reliability and the reliability of the composite unit or score which is subsequently analyzed, e.g., daily or weekly session totals. Two approaches to determining interobserver reliability are described: percentage agreement and "correlational" measures of reliability. The interpretation of these estimates, factors affecting their magnitude, and the advantages and limitations of each approach are presented.
Journal of applied behavior analysis, 1977 · doi:10.1901/jaba.1977.10-103