Assessment & Research

Being reliable: issues in determining the reliability and making sense of observations of adults with congenital deafblindness?

Prain et al. (2012) · Journal of intellectual disability research : JIDR

★ The Verdict

Percent agreement can fake good reliability on rare behaviors—always pair it with Cohen's kappa or consensus coding.

✓ Read this if BCBAs who collect or supervise observational data on clients with severe or multiple disabilities.

✗ Skip if Practitioners who rely only on trial-by-trial automated measures (e.g., computer-collected response counts).

01Research in Context

What this study did

Prain et al. (2012) looked at how we check if two observers agree when they watch adults who are both deaf and blind. The team compared different math formulas used to score agreement. They wanted to see which formula gives a honest picture when the behavior is rare or subtle.

The paper is a methods guide, not an experiment. It shows how the same video clips can get very different 'reliability' scores depending on the formula picked.

What they found

Simple percent agreement made the data look great even when observers missed most events. Cohen's kappa told a truer, lower story. The authors warn that percent agreement can hide poor reliability when behaviors happen only once in a while.

They recommend always reporting Cohen's kappa and, when possible, having observers reach consensus before final coding.

How this fits with other research

Valluripalli Soorya et al. (2025) later showed the same danger in kids with severe ID from rare genetic syndromes. Standard autism tools failed because the kids' behaviors were too infrequent, backing up I et al.'s warning that rare events need tougher reliability checks.

Vos et al. (2013) echo the message from a different angle. Their methodology paper tells analysts to pick Yule's Q instead of simpler transitional probabilities when quantifying contingencies in sparse data. Both papers push the field to use more sensitive stats.

Mount et al. (2011) make a similar advance for motor data. They show random-coefficient modeling catches individual variability that regular GLM misses, just as Cohen's kappa catches observer drift that percent agreement hides.

Why it matters

Before you trust any observation log, demand both percent agreement and Cohen's kappa. If kappa is low even when agreement looks high, retrain observers or hold a consensus meeting. This single step keeps treatment decisions, functional analyses, and publication claims honest.

Free CEUs

Want CEUs on This Topic?

The ABA Clubhouse has 60+ free CEUs — live every Wednesday. Ethics, supervision & clinical topics.

Join Free →

→ Action — try this Monday

Add a second observer to score five random sessions, then calculate both percent agreement and Cohen's kappa—retain data only if kappa ≥ 0.80.

02At a glance

Intervention

not applicable

Design

methodology paper

Population

intellectual disability, other

Finding

not reported

03Original abstract

BACKGROUND: Most research into interactions with people who are congenitally deafblind involves observational data. In order for practitioners and researchers to have confidence in the findings of observational studies, researchers need to demonstrate that the processes employed are replicable and trustworthy. This paper draws on data from an observational study of adults with congenital deafblindness to illustrate issues in determining inter-rater reliability, and interpreting observational data. METHOD: Data from 34 10-min observations of adults with congenital deafblindness and their interactions with support staff were assessed for inter-rater reliability using percentage agreement calculated in three different ways and Cohen's κ. RESULTS: Large variation resulted from the different ways in which inter-rater reliability was calculated, largely due to high levels of non-occurrence of many behaviours in the coding tool used. CONCLUSION: This study highlights the need to exercise caution when ascertaining the reliability of observational studies and demonstrates the value in using multiple methods for calculating inter-rater reliability. The paper concludes with an examination of the potential merits of using consensus coding in observational studies of interactions with people with congenital deafblindness or profound intellectual and multiple disabilities.

Journal of intellectual disability research : JIDR, 2012 · doi:10.1111/j.1365-2788.2011.01503.x