A review of the observational data-collection and reliability procedures reported in The Journal of Applied Behavior Analysis.
Early JABA papers under-checked observer agreement—keep your own reliability at 90 % and measure it every condition.
01Research in Context
What this study did
The author read every paper in JABA from 1968 through 1975. He counted how many used direct observation. He noted who reported reliability checks and how often.
What they found
Three out of four studies watched behavior live. Almost all said "we checked reliability." Yet only half hit the 90 % agreement mark. Fewer than one in four checked reliability in every condition.
How this fits with other research
Lancioni et al. (2008) later showed the eyeball method itself is shaky. Raters looking at the same FA graphs only agreed about half the time. This builds on Stolz (1977) by proving weak reliability is not just under-reported—it is also hard to achieve.
Spanoudis et al. (2011) offered a fix: calibrate observers against scripted videos instead of trusting a second human. Their accuracy landed within 0.1 responses per minute. This successor paper shows the field has moved past simple inter-observer agreement.
Taylor et al. (2022) added computer tools. Objective algorithms matched visual inspection on only 84 % of graphs. The gap echoes B’s warning: without tight checks, our conclusions wobble.
Why it matters
If you run sessions without fresh reliability data, you may be chasing noise. Check agreement at least once per condition and aim for 90 %. Better yet, calibrate observers with gold-standard videos or add an algorithm backup. Solid data make solid decisions.
Want CEUs on This Topic?
The ABA Clubhouse has 60+ free CEUs — live every Wednesday. Ethics, supervision & clinical topics.
Join Free →Schedule one reliability check in each condition this week and score it live.
02At a glance
03Original abstract
The research published in the Journal of Applied Behavior Analysis (1968 to 1975) was surveyed for three basic elements: data-collection methods, reliability procedures, and reliability scores. Three-quarters of the studies reported observational data. Most of these studies' observational methods were variations of event recording, trial scoring, interval recording, or time-sample recording. Almost all studies reported assessment of observer reliability, usually total or point-by-point percentage agreement scores. About half the agreement scores were consistently above 90%. Less than one-quarter of the studies reported that reliability was assessed at least once per condition.
Journal of applied behavior analysis, 1977 · doi:10.1901/jaba.1977.10-97