Artifact, bias, and complexity of assessment: the ABCs of reliability.
Observer drift and expectancies can fake high agreement—fight them with blind coding, simple definitions, and booster training.
01Research in Context
What this study did
Kazdin (1977) wrote a narrative review. He listed ways observer agreement can go wrong. The paper is a warning, not an experiment.
He grouped the problems into five buckets: reactivity, drift, complex codes, expectancies, and feedback.
What they found
The review found that small changes in how you train, watch, or coach observers can fake high agreement.
For example, telling coders the study 'should work' can nudge them to see what you hope to see.
How this fits with other research
Stolz (1977) counted real JABA articles from 1968-1975. Only half checked reliability in every condition. E’s warning matched the messy data B found.
Matson et al. (1989) later showed raters could not agree on the 'semantic base' of text. Their near-zero agreement is a live demo of the observer drift E described.
Spanoudis et al. (2011) skipped traditional agreement. They calibrated observers against gold-standard videos. This successor method fixes the very biases E flagged.
Lancioni et al. (2008) and Ford et al. (2020) revisited visual inspection. Both found so-so or high agreement, depending on the graph set. The mixed results echo E’s point: reliability is fragile, not a given.
Why it matters
Check for drift every few sessions. Keep codes short and clear. Mask the study aim from coders. Re-train after long breaks. These 1977 tips still save today’s data from silent bias.
Want CEUs on This Topic?
The ABA Clubhouse has 60+ free CEUs — live every Wednesday. Ethics, supervision & clinical topics.
Join Free →Pick one behavior code you use, cut the definition to three clear bullets, and schedule a five-minute recalibration with your RBT before the next session.
02At a glance
03Original abstract
Interobserver agreement (also referred to here as "reliability") is influenced by diverse sources of artifact, bias, and complexity of the assessment procedures. The literature on reliability assessment frequently has focused on the different methods of computing reliability and the circumstances under which these methods are appropriate. Yet, the credence accorded estimates of interobserver agreement, computed by any method, presupposes eliminating sources of bias that can spuriously affect agreement. The present paper reviews evidence pertaining to various sources of artifact and bias, as well as characteristics of assessment that influence interpretation of interobserver agreement. These include reactivity of reliability assessment, observer drift, complexity of response codes and behavioral observations, observer expectancies and feedback, and others. Recommendations are provided for eliminating or minimizing the influence of these factors from interobserver agreement.
Journal of applied behavior analysis, 1977 · doi:10.1901/jaba.1977.10-141