Assessment & Research

Artifact, bias, and complexity of assessment: the ABCs of reliability.

Kazdin (1977) · Journal of applied behavior analysis

★ The Verdict

Observer drift and expectancies can fake high agreement—fight them with blind coding, simple definitions, and booster training.

✓ Read this if BCBAs who collect live observation data in clinics or schools.

✗ Skip if Practitioners who rely only on permanent product or sensor data.

01Research in Context

What this study did

Kazdin (1977) wrote a narrative review. He listed ways observer agreement can go wrong. The paper is a warning, not an experiment.

He grouped the problems into five buckets: reactivity, drift, complex codes, expectancies, and feedback.

What they found

The review found that small changes in how you train, watch, or coach observers can fake high agreement.

For example, telling coders the study 'should work' can nudge them to see what you hope to see.

How this fits with other research

Stolz (1977) counted real JABA articles from 1968-1975. Only half checked reliability in every condition. E’s warning matched the messy data B found.

Matson et al. (1989) later showed raters could not agree on the 'semantic base' of text. Their near-zero agreement is a live demo of the observer drift E described.

Spanoudis et al. (2011) skipped traditional agreement. They calibrated observers against gold-standard videos. This successor method fixes the very biases E flagged.

Lancioni et al. (2008) and Ford et al. (2020) revisited visual inspection. Both found so-so or high agreement, depending on the graph set. The mixed results echo E’s point: reliability is fragile, not a given.

Why it matters

Check for drift every few sessions. Keep codes short and clear. Mask the study aim from coders. Re-train after long breaks. These 1977 tips still save today’s data from silent bias.

Free CEUs

Want CEUs on This Topic?

The ABA Clubhouse has 60+ free CEUs — live every Wednesday. Ethics, supervision & clinical topics.

Join Free →

→ Action — try this Monday

Pick one behavior code you use, cut the definition to three clear bullets, and schedule a five-minute recalibration with your RBT before the next session.

02At a glance

Intervention

not applicable

Design

narrative review

Finding

not reported

03Original abstract

Interobserver agreement (also referred to here as "reliability") is influenced by diverse sources of artifact, bias, and complexity of the assessment procedures. The literature on reliability assessment frequently has focused on the different methods of computing reliability and the circumstances under which these methods are appropriate. Yet, the credence accorded estimates of interobserver agreement, computed by any method, presupposes eliminating sources of bias that can spuriously affect agreement. The present paper reviews evidence pertaining to various sources of artifact and bias, as well as characteristics of assessment that influence interpretation of interobserver agreement. These include reactivity of reliability assessment, observer drift, complexity of response codes and behavioral observations, observer expectancies and feedback, and others. Recommendations are provided for eliminating or minimizing the influence of these factors from interobserver agreement.

Journal of applied behavior analysis, 1977 · doi:10.1901/jaba.1977.10-141