Assessment & Research

Interrater Agreement on the Visual Analysis of Individual Tiers and Functional Relations in Multiple Baseline Designs.

Wolfe et al. (2016) · Behavior modification 2016
★ The Verdict

Even experts barely agree when eyeballing multiple-baseline graphs, so pair every visual call with a number-based check.

✓ Read this if BCBAs who rely on visual analysis to make treatment decisions.
✗ Skip if Practitioners already using quantitative decision rules for every graph.

01Research in Context

01

What this study did

Wolfe et al. (2016) asked expert BCBAs to look at multiple-baseline graphs. Each rater judged single tiers and the overall functional relation.

The survey used email. Raters worked alone. No extra training was given.

02

What they found

Agreement landed in the 'barely adequate' zone. Experts often disagreed on whether a tier showed an effect and whether the whole graph proved a functional relation.

The finding was labeled inconclusive because the low agreement undercuts confident visual calls.

03

How this fits with other research

Kahng et al. (2010) looked like good news: experts reached high agreement on single-case graphs. The 2016 paper flips that outcome. The difference is not a mistake. SungWoo used clearer trends and gave brief training. Katie et al. used raw graphs with no warm-up, so real-world noise showed through.

Diller et al. (2016) ran the same kind of survey on multielement graphs the same year. They also got inconclusive agreement, showing the problem is not tied to one design.

Wolfe et al. (2023) went further and tested which graph features hurt agreement most. Steep trend and large effect size swung raters the most, backing up why experts in Katie’s study split.

04

Why it matters

Your visual inspection is not enough. Add a quantitative aid like the conservative dual-criterion method or GLMM check. Share the numeric rule with your team so everyone uses the same lens. This small step turns soft ‘looks like a change’ into defendable decisions for parents, funders, and peer reviewers.

Free CEUs

Want CEUs on This Topic?

The ABA Clubhouse has 60+ free CEUs — live every Wednesday. Ethics, supervision & clinical topics.

Join Free →
→ Action — try this Monday

Apply the conservative dual-criterion to your current multiple-baseline graph and compare the result with your visual call.

02At a glance

Intervention
not applicable
Design
survey
Sample size
52
Finding
inconclusive

03Original abstract

Previous research on visual analysis has reported low levels of interrater agreement. However, many of these studies have methodological limitations (e.g., use of AB designs, undefined judgment task) that may have negatively influenced agreement. Our primary purpose was to evaluate whether agreement would be higher than previously reported if we addressed these weaknesses. Our secondary purposes were to investigate agreement at the tier level (i.e., the AB comparison) and at the functional relation level in multiple baseline designs and to examine the relationship between raters' decisions at each of these levels. We asked experts (N = 52) to make judgments about changes in the dependent variable in individual tiers and about the presence of an overall functional relation in 31 multiple baseline graphs. Our results indicate that interrater agreement was just at or just below minimally adequate levels for both types of decisions and that agreement at the individual tier level often resulted in agreement about the overall functional relation. We report additional findings and discuss implications for practice and future research.

Behavior modification, 2016 · doi:10.1177/0145445516644699