A cautionary note on the use of probability values to evaluate interobserver agreement.
Stop using p-values for IOA when data points influence each other—skip intervals or model the links.
01Research in Context
What this study did
The authors looked at how we test interobserver agreement. They asked: do the usual p-values still work when data points are linked over time?
They wrote a short warning paper. They showed that normal chi-square tests give false positives when one score predicts the next score.
What they found
The paper says the math breaks down. If your behavior data are serially correlated, p-values for IOA are meaningless.
They give two fixes. Skip every other interval to break the chain. Or switch to a Markov model that expects the links.
How this fits with other research
Parsons et al. (1981) found observers cheat when they score their own agreement. Hartmann et al. (1982) adds a second trap: even honest scores can fail the math test.
Fisch (1998) showed our eyes miss slow trends. Together these papers say: don’t trust people, don’t trust p-values, and don’t trust your eyes alone.
Hastings et al. (2001) later showed staff reports swing wildly day to day. That daily swing is the same serial correlation Hartmann et al. (1982) warned about.
Why it matters
Before you report “significant IOA,” run a quick check. Plot your data. If one interval looks like the next, skip intervals or use the Markov fix. It takes five extra minutes and saves you from publishing bad numbers.
Want CEUs on This Topic?
The ABA Clubhouse has 60+ free CEUs — live every Wednesday. Ethics, supervision & clinical topics.
Join Free →Re-score one recent session using every other interval and compare the IOA; if it changes, your data are serially correlated.
02At a glance
03Original abstract
Proposed methods of assessing the statistical significance of interobserver agreements provide erroneous probability values when conducted on serially correlated data. Investigators who wish to evaluate interobserver agreements by means of statistical significance can do so by limiting the analysis to every k(th) interval of data, or by using Markovian techniques which accommodate serial correlations.
Journal of applied behavior analysis, 1982 · doi:10.1901/jaba.1982.15-189