Assessment & Research

Considerations in the choice of interobserver reliability estimates.

Hartmann (1977) · Journal of applied behavior analysis

★ The Verdict

Pick percentage agreement for trial data and correlation for session totals, and always tell the reader which you used.

✓ Read this if BCBAs who write or supervise data-collection protocols in any setting.

✗ Skip if Practitioners who only use automated sensors or permanent product measures.

01Research in Context

What this study did

Rider (1977) wrote a think-piece, not an experiment. He compared two ways to check if two observers see the same thing.

One way is simple percentage agreement: both observers count, then you divide matches by total. The other way is correlation: you plot one observer’s counts against the other’s and see if the line slopes up.

He asked when each number tells the truth and when it fools you.

What they found

Percentage agreement works best when you score each single yes-no trial. It catches big misses but can hide bias if both observers are wrong the same way.

Correlation works best when you compare total counts across whole sessions. It shows if the observers move up and down together, but it can be high even when both miss lots of behavior.

No single number is perfect; you have to pick the one that fits your question.

How this fits with other research

Cook et al. (2020) later showed the same worry in momentary time sampling. They told us to spot-check with real duration data so drift does not hide behind a high correlation.

Newland (2024) moves the debate forward. He fixes the risk-ratio formula that many of us plug reliability numbers into, proving the 1977 warning still matters today.

Iwata et al. (1990) give a live demo: the SIT Scale hit 89-a large share agreement across items, proving tight trial-level checks are possible when the code is clear.

Why it matters

Next time you train staff, pick your check to match the grain of your data. Use quick percentage agreement for discrete trial sheets. Use correlation for summary sheets across days. Report both if you can. This old paper keeps you from bragging about a high number that hides sloppy watching.

FREE CEUs

Get CEUs on This Topic — Free

The ABA Clubhouse has 60+ on-demand CEUs including ethics, supervision, and clinical topics like this one. Plus a new live CEU every Wednesday.

✓ 60+ on-demand CEUs (ethics, supervision, general)

✓ New live CEU every Wednesday

✓ Community of 500+ BCBAs

✓ 100% free to join

Join The ABA Clubhouse — Free →

→ Action — try this Monday

Open your last five IOA sheets and label each as trial-level or session-level, then swap in the matching index if you used the wrong one.

02At a glance

Intervention

not applicable

Design

theoretical

Finding

not reported

03Original abstract

Two types of interobserver reliability values may be needed in treatment studies in which observers constitute the primary data-acquisition system: trial reliability and the reliability of the composite unit or score which is subsequently analyzed, e.g., daily or weekly session totals. Two approaches to determining interobserver reliability are described: percentage agreement and "correlational" measures of reliability. The interpretation of these estimates, factors affecting their magnitude, and the advantages and limitations of each approach are presented.

Journal of applied behavior analysis, 1977 · doi:10.1901/jaba.1977.10-103