Observer reliability as a function of circumstances of assessment.
Reliability checks can be gamed—blind your observers, monitor sessions, and calculate reliability across independent pairs to get honest IOA.
01Research in Context
What this study did
The team watched how observers act when they know someone is checking their data.
They changed three things: telling or not telling observers about checks, having a boss in the room or not, and letting pairs grade themselves or using outside pairs.
Then they looked at how high the IOA numbers looked under each setup.
What they found
IOA jumped when observers knew they were being checked that day.
Scores also rose when no boss watched the session and when partners checked each other instead of strangers.
In short, the same video could give very different reliability scores depending on the check setup.
How this fits with other research
Normand et al. (2023) warn that researchers who wear both clinician and scientist hats can bias data; Branch et al. (1977) show even well-meaning observers can tilt numbers if the check system is loose.
Critchfield et al. (2003) found that a “reward” can secretly punish; here, a “check” can secretly inflate. Both papers scream the same message: tiny procedural details swing results.
Schmidt et al. (1969) got clean data by using clear, reset-based rules in class; N et al. prove you need equally clear, blind, cross-pair rules to get honest IOA.
Why it matters
You can’t trust high IOA if your observers know today is check day, sit alone, and swap sheets with a friend. Build blind spot checks, rotate outside pairs, and drop in unannounced. Honest data starts with honest measurement.
Get CEUs on This Topic — Free
The ABA Clubhouse has 60+ on-demand CEUs including ethics, supervision, and clinical topics like this one. Plus a new live CEU every Wednesday.
Pick one client, have a second staff member quietly IOA from a video file the first observer did not know was saved.
02At a glance
03Original abstract
THREE FACTORS CHARACTERISTIC OF EXPERIMENTAL SETTINGS WERE HYPOTHESIZED TO INFLATE ARTIFACTUALLY THE RELIABILITY OF OBSERVATIONAL RECORDINGS: (a) knowledge by observers of when and by whom their reliability is being assessed, (b) the absence of the experimenter or a monitor to prevent cheating, and (c) computation of reliability within- (versus between-) observer group. Three groups of four observers used a standard nine-category observational code for disruptive behavior in recording from videotapes of a classroom for 22 days. Analyses revealed considerable increases in average occurrence reliability as a function of the main effects of each of the experimental factors. The specific increases in reliability associated with each of the 12 combinations of the experimental factors are presented for each category of behavior. The possible role of observer-training procedures and behavioral definitions as determiners of nonartifactual reliability is discussed.
Journal of applied behavior analysis, 1977 · doi:10.1901/jaba.1977.10-317