On the not so recent invention of interobserver reliability: A commentary on two articles by Birkimer and Brown.
Interobserver reliability formulas are borrowed tools—credit the source and focus on solid practice, not pride of invention.
01Research in Context
What this study did
Hartmann et al. (1979) wrote a short commentary. They read two fresh JABA articles about interobserver reliability. The authors wanted to show these "new" stats are actually old tools from other fields.
They dug into history books. They found Fisher’s Exact Test and percent-agreement formulas used decades before JABA began. The paper is a myth-buster, not an experiment.
What they found
The team showed that Fisher’s Exact Test and simple agreement formulas existed long before behavior analysts claimed them. The tools came from agriculture and psychology in the early 1900s.
Their main point: give credit where it is due, and stop treating basic stats as ABA inventions.
How this fits with other research
Jessel et al. (2020) backs the warning. Their big map of functional-analysis papers shows wide swings in how reliability is scored. The lack of standard support proves P et al.’s call for historical honesty still matters.
Greer et al. (2020) and Fisher et al. (2016) seem to clash. One says synthesized FA works fine; the other says it misses functions. Both studies still lean on the same old agreement formulas P et al. traced, showing the field keeps using the grandparent stats without question.
Wolfe et al. (2026) extends the worry into visual analysis. They compare masked versus traditional graph judging and find only modest agreement. The outcome echoes P et al.: reliability methods need routine check-ups, not hero worship.
Why it matters
Next time you write "IOA calculated with 90% agreement," remember the formula is older than JABA. Cite the method, but do not brag like we invented it. Use the saved space to report how you trained observers, how often you checked, and what disagreements looked like. History says the math is solid; our job is to apply it carefully and transparently.
Want CEUs on This Topic?
The ABA Clubhouse has 60+ free CEUs — live every Wednesday. Ethics, supervision & clinical topics.
Join Free →Add a one-line history note in your IOA section: "Method predates JABA; see P et al. 1979."
02At a glance
03Original abstract
The two articles by Birkimer and Brown (1979a, 1979b) are interesting additions to the bibliography of papers concerned with interob- server reliability published in the Journal of Applied Behavior Analysis. The growing length of this collection attests to the continuing interest of applied behavioral researchers in assessing and improving the quality of observational data. Unfortunately, by failing to reference or discuss parallel material outside the behavioral tradition, the authors tend to perpetuate the myth that scholarly concern with interobserver reliability coincides with the history of applied behavior analysis. An examination of the historical ante- cedents of the principal topics discussed by Birkimer and Brown (1979a, 1979b) may help to counter this myth. The topic of their first paper (1979a), methods of measuring inter- observer reliability, has resurfaced with some regularity at least since the early 1940's (see Fleiss, 1975, for a review of methods of measur- ing agreement between two judges for occur- rence-nonoccurrence data). The topic of their second paper (1979b), assessing the statistical significance of two-by-two tabled data, is based directly on the Exact Probability Test which was described by R. A. Fisher in 1934. For both topics dealt with by Birkimer and Brown, schol- arly interest clearly precedes the inauguration of the Journal of Applied Behavior Analysis. But enough of history.
Journal of applied behavior analysis, 1979 · doi:10.1901/jaba.1979.12-559