Influences of response rate and distribution on the calculation of interobserver reliability scores.
High-rate or end-heavy behavior makes total IOA look perfect while exact-agreement IOA reveals observer drift.
01Research in Context
What this study did
Rolider et al. (2012) ran math checks on two common IOA formulas. They asked what happens when the same data have high response rates or most responses pile up at the end of each interval.
They fed fake data sets into exact-agreement IOA and total IOA. The goal was to see if the numbers still tell the truth when behavior gets fast or bunched.
What they found
Exact-agreement IOA cracked first. It dropped hard when responses came fast or landed late in the interval.
Total IOA stayed high and cheerful even when observers quietly disagreed. The paper warns that total IOA can hide real drift.
How this fits with other research
Cox et al. (2025) now gives you eight better tools. They say swap in precision, recall, or F1 instead of raw percent agreement. Their toolkit directly fixes the ceiling-effect trap that U et al. exposed.
Hausman et al. (2022) pushed the same lever further. They cut the number of IOA sessions and still saw the same rate-driven swings. This tells us the problem is stable across more or less data.
Jones et al. (1977) saw it coming. They already asked for Kappa and occurrence/non-occurrence splits long before U et al. proved raw percent agreement can lie.
Why it matters
Next time you track hand-flaps or vocal stereotypy that top 100 per minute, do not trust a glossy 95% total IOA. Flip to exact-agreement or Cox’s precision score. If the second number dips, retrain observers before you trust any trend line.
Want CEUs on This Topic?
The ABA Clubhouse has 60+ free CEUs — live every Wednesday. Ethics, supervision & clinical topics.
Join Free →Run both total and exact-agreement IOA on your fastest behavior this week—if the gap is over 10%, retrain observers.
02At a glance
03Original abstract
We examined the effects of several variations in response rate on the calculation of total, interval, exact-agreement, and proportional reliability indices. Trained observers recorded computer-generated data that appeared on a computer screen. In Study 1, target responses occurred at low, moderate, and high rates during separate sessions so that reliability results based on the four calculations could be compared across a range of values. Total reliability was uniformly high, interval reliability was spuriously high for high-rate responding, proportional reliability was somewhat lower for high-rate responding, and exact-agreement reliability was the lowest of the measures, especially for high-rate responding. In Study 2, we examined the separate effects of response rate per se, bursting, and end-of-interval responding. Response rate and bursting had little effect on reliability scores; however, the distribution of some responses at the end of intervals decreased interval reliability somewhat, proportional reliability noticeably, and exact-agreement reliability markedly.
Journal of applied behavior analysis, 2012 · doi:10.1901/jaba.2012.45-753