Reliability in the context of the experiment: A commentary on two articles by Birkimer and Brown.
Track observer agreement across the whole study and let the stability of those numbers decide whether your single-case data are trustworthy.
01Research in Context
What this study did
Yelton (1979) wrote a short, sharp commentary on two earlier articles by Birkimer and Brown. He took their ideas about observer agreement and stretched them into a full system for judging data quality inside any single-case experiment. The paper is pure theory—no new data, just a blueprint for practitioners who want to know when their numbers are solid enough to trust.
What they found
The core message is simple: stop treating observer agreement as a one-time checkbox. Instead, use those agreement scores to measure point-by-point variability across the whole study. High, stable agreement means the data picture is clear; dips or jumps warn you the picture is blurry. If the variability is too wild, the experiment is not ready for a final call.
How this fits with other research
Rider (1977) set the table two years earlier by mapping two basic ways to calculate reliability—percentage agreement for trial-level checks and correlational indices for session-level trends. Yelton (1979) keeps both tools but adds the new rule: track them continuously, not just at the start.
Cook et al. (2020) give a modern example of the same spirit. They show how to slip brief duration probes into momentary time-sampling sessions so you can spot measurement drift while the study runs—exactly the kind of live quality check Yelton (1979) was asking for.
Wilder et al. (2023) push the idea beyond observer agreement. They argue that procedural-fidelity percentages hide useful rate information, just like raw agreement percentages can hide variability. Both papers say the same thing: add a rate metric if you want to see the real story.
Sasson et al. (2018) close the loop by giving a ready-made reporting scaffold. They turn the old plea for transparent data into fill-in-the-blank language you can drop into your next single-case paper so meta-analysts can judge adequacy without guessing.
Why it matters
Next time you graph a client’s data, pick three random sessions and run fresh agreement checks. If the new numbers match your original reliability file and stay flat across sessions, you can sleep on those data. If they wobble, retrain observers or collect more proof before you claim victory. It takes ten minutes and saves your reputation.
Want CEUs on This Topic?
The ABA Clubhouse has 60+ free CEUs — live every Wednesday. Ethics, supervision & clinical topics.
Join Free →Add a standing column to your data sheet labeled ‘IOA tonight’ and calculate agreement on the last three data points before every team meeting.
02At a glance
03Original abstract
Two sources of variability must each be considered when examining change in level between two sets of data obtained by human observers; namely, variance within data sets (phases) and variability attributed to each data point (reliability). Birkimer and Brown (1979a, 1979b) have suggested that both chance levels and disagreement bands be considered in examining observer reliability and have made both methods more accessible to researchers. By clarifying and extending Birkimer and Brown's papers, a system is developed using observer agreement to determine the data point variability and thus to check the adequacy of obtained data within the experimental context.
Journal of applied behavior analysis, 1979 · doi:10.1901/jaba.1979.12-565