Visual Analysis Agreement and Reliability: 155 Studies Reviewed

Key Findings

What 155 articles tell us

Axis scaling on graphs can shift an analyst's perception of effect size without changing the underlying data at all.
Moving averages applied to daily behavior data can reveal cyclical patterns that are invisible to standard visual inspection.
Machine learning models can reliably identify differentiated effects in alternating treatment graphs, matching trained analyst performance.
Alternative agreement metrics like precision, recall, and F1 detect disagreement that raw percentage agreement scores conceal, especially at ceiling or floor.
Masked visual analysis produces similar reliability to traditional visual analysis and can serve as a safeguard against confirmation bias.

Free CEUs

Get 60+ CEUs Free in The ABA Clubhouse

Live CEU every Wednesday — ethics, supervision, and clinical topics. Always free.

Join Free →

Frequently Asked Questions

Common questions from BCBAs and RBTs

Graphs with high variability, unusual axis scaling, or unclear phase changes make agreement harder. Research shows that even trained analysts disagree often, especially on graphs from functional analyses. Adding a numeric reliability check helps reduce this problem.

Interobserver agreement (IOA) measures how often two data collectors record the same thing at the same time. High IOA means your data are reliable. Low IOA means the behavior is being defined or recorded differently, which can make your intervention decisions unreliable.

Masked visual analysis means the analyst reviews a graph without knowing which phase is baseline and which is treatment. This removes expectation bias. Research shows it produces similar reliability to standard analysis and can catch cases where analysts are being influenced by what they hope to see.

A moving average smooths out day-to-day variability by averaging a few data points at a time. This helps you spot longer trends and cyclical patterns that are hidden by normal noise in daily data.

Use F1 or precision-recall when the behavior is rare or when occurrence or non-occurrence rates are very unequal. Percentage agreement can be artificially high in those cases even when analysts are disagreeing on actual occurrences.