Video Modeling: A Practitioner's Guide for BCBAs, RBTs, and School Teams
Video modeling (VM) is a behavioral teaching procedure in which a recorded demonstration of a target behavior is presented to a learner prior to an opportunity to imitate; the learner observes the model, is given a chance to perform the behavior, and receives reinforcement for correct imitation. It is one of the most extensively researched technology-based interventions in ABA, with evidence spanning communication, social skills, daily living, academic, vocational, and staff-training domains. The BACB, NPDC, and multiple systematic reviews classify VM as an established evidence-based practice for autistic learners across age ranges. Procedurally, VM functions as an antecedent-based stimulus-control intervention: the video establishes a discriminative stimulus for the target response, and correct imitation contacts the programmed reinforcer. Unlike in-vivo modeling, the video is replicable, pausable, reviewable, and easily delivered via tablet or smartphone in any setting.
01What the Research Says
What video modeling actually is
Video modeling requires a learner to observe a recorded demonstration of a target behavior before being given an opportunity to imitate. The core procedure has five operational components: (1) video production — selecting or recording a clip that clearly depicts the target skill at a pace and angle the learner can track; (2) viewing protocol — how and how many times the clip is shown per session, including whether viewing is passive or prompted; (3) opportunity to imitate — an immediate, unambiguous chance to perform the behavior shown; (4) reinforcement — contingent delivery of a consequence for correct imitation; and (5) error correction — a structured response to incorrect or absent imitation, typically replaying the clip or providing a prompt. VM sits within the broader observational-learning and stimulus-control literature: the model is the SD, the opportunity-to-imitate is the trial, and mastery is assessed via percentage of independent correct responses source.
The procedural rationale for VM over purely spoken or picture-based instruction is that many skill targets — especially complex multi-step sequences or social interactions — are dynamic and do not reduce cleanly to static images or verbal descriptions. A 40–60 second clip of a person ordering food in a restaurant demonstrates posture, eye contact, intonation, and sequencing simultaneously in a way that a task analysis card cannot source.
Variants: standard VM, video self-modeling, point-of-view VM
Standard video modeling uses a third-person perspective with a model other than the learner — a peer, adult, or animated character — performing the target skill. The learner observes and then imitates. This is the most widely published variant and generalizes readily to novel people and materials when multiple exemplars are embedded in the clip.
Video self-modeling (VSM) presents the learner with a recording of themselves accurately performing the target skill. The clip may be edited to remove errors or show the learner performing at a level slightly above current repertoire (feedforward VSM). VSM is particularly useful for learners who are already capable of emitting components of a skill but do not string them together reliably, or for learners who respond better to seeing themselves than a novel model. Vascelli and Berardo demonstrated VSM delivered via telehealth, with a parent facilitating, taught a 12-year-old with Dravet syndrome to sequence numbers, position facial features, and read three-letter words across a multiple-baseline design, with gains maintained at a two-month follow-up source.
Point-of-view (POV) video modeling records the clip from the first-person camera angle, showing hands performing tasks as the learner would see them if they were executing the skill. POV VM is especially suited to daily living and vocational tasks where spatial orientation matters — tasks like food preparation, laundry, or workplace routines. Aldi and colleagues used POV VM plus least-to-most prompting to teach two adolescent males with ASD three activities of daily living; both reached criterion, though performance fell below mastery at a one-month follow-up probe without additional practice, highlighting the need for planned maintenance procedures source.
Interactive video modeling augments standard or POV VM with embedded prompts, text overlays, voiceover narration, or embedded choice points. Adding voiceover narration and on-screen text to a VM clip consistently improves staff training outcomes: Yarzebski and Dickson found a single generic video model with voiceover narration and on-screen text trained three caregivers to implement graduated guidance in under 26 minutes each, without additional coaching, rehearsal, or feedback, and the training generalized to untrained tasks source.
Animated vs. human models
A longstanding practitioner question is whether the model must be human or whether animated characters are effective. Bloh and colleagues compared human and animated video models for teaching intraverbal responding and motor imitation to eight children with ASD aged 6–10, using an adapted multiple-baseline design with 40–60-second clips repeated three times per session in counterbalanced order. Both modalities produced gains; neither was consistently superior across participants. The finding supports rotating human and animated clips within the same teaching sequence without compromising effectiveness, and suggests model type need not be a bottleneck when producing training materials source.
The practical implication: when access to a skilled human model is limited, animated clips created with tools like Vyond are a viable alternative. Linnehan demonstrated this for emotion recognition, using animated video clips depicting varied fear and anger scenarios to teach two autistic adolescents to identify and label emotional states, embedding the clips within a prompt-fade and intraverbal-questioning sequence source.
Effectiveness: meta-analytic context and key SCED evidence
The VM literature has been subject to multiple systematic reviews and meta-analyses. The broad consensus across these bodies of work is that VM meets evidence-based practice criteria for autistic learners across communication, social, daily living, and vocational targets. Effect sizes in single-subject work are large, acquisition is typically faster than in-vivo modeling, and gains generalize more readily than with picture or text instruction.
Within the extracted corpus, the most relevant effectiveness data come from single-subject experiments:
- Communication and verbal behavior. VM produces reliable gains in intraverbal responding, conversational speech, and vocalization for learners with ASD. The Charlop and Milstein (1989) foundational study teaching conversational speech via VM is referenced in multiple corpus papers as a benchmark source.
- Social and play skills. Ezzeddine and colleagues evaluated VM for teaching play comments to dyads of children with ASD. Video modeling alone brought three of six learners to mastery; the other three required added reinforcement and prompting, illustrating that VM is frequently sufficient but not universally so source.
- Activities of daily living. POV VM with prompts taught ADL skills to two adolescent males; both met criterion, with performance above baseline but below criterion at one-month follow-up without maintenance procedures source. Portable VM via smartphone taught four daily living skills to a young adult with intellectual disability in a transition program, with increased independence across all four targets source.
- Vocational and employment skills. McLucas and colleagues combined VM with behavioral feedback to teach vocational social skills — including requesting missing materials — to four autistic transition-aged youth (ages 15–20), achieving rapid mastery that generalized across novel tasks and materials in a simulated workplace source.
- Staff and caregiver training. Across 18 experiments identified by Marano and colleagues, VM produced positive training outcomes in roughly 44% of studies and mixed outcomes (some but not all staff reached mastery) in 55%, indicating VM alone is reliable for many staff but that about half of cases will need added feedback or rehearsal source. Adding nonexemplars — clips showing the errors staff most commonly make — to VM training significantly increased and maintained procedural integrity compared to exemplar-only VM in a multi-element design with six early-career staff source.
Video modeling vs. in-vivo modeling
The classic comparison is Charlop-Christy, Le, and Freeman's study showing VM produced faster acquisition than in-vivo modeling across multiple skill targets with children with autism, while generalization was comparable. This finding, referenced across multiple corpus papers, established VM as more efficient than live demonstration for a significant proportion of learners source.
The mechanism is likely a combination of: (a) the video's stimuli are consistent across repetitions — no variation in model behavior, tone, or pacing from trial to trial; (b) the learner controls the discriminative stimulus by watching, pausing, and rewatching, which may improve attending; and (c) the interval between viewing and imitation opportunity is brief and experimentally clean.
In-vivo modeling retains advantages when: the behavior requires real-time responsiveness (e.g., a conversational back-and-forth that must respond to the learner's specific output); when the model needs to demonstrate physical guidance or graduated guidance in real time; or when the learner does not attend to screens. VM and in-vivo modeling are not mutually exclusive — combining a brief VM viewing with in-vivo prompting is common in ADL instruction.
Video modeling vs. video prompting
Video prompting — a related but distinct procedure — presents clips of individual steps as the learner works through the task, rather than showing the complete task before imitation. Thomas and colleagues directly compared VM and video prompting in an alternating-treatments design with four adolescents with ASD (ages 13–18) learning complex meal-preparation skills. Video modeling produced faster mastery and approximately 60% fewer errors than video prompting, with comparable on-task behavior source.
The practical decision point: use full-task VM when the learner can hold the complete task representation in memory and attend through the full clip; use video prompting when the task is long or complex and the learner benefits from step-by-step on-demand support during execution.
Variables that affect outcomes
Model fidelity. The clip must depict the target behavior accurately, at a pace the learner can track, with the relevant stimuli clearly visible. For staff training, adding nonexemplars alongside correct exemplars for the steps most commonly performed incorrectly substantially increases and sustains procedural integrity source. For learner programs, using a model whose physical characteristics and skill level are too discrepant from the learner can reduce imitation.
Learner attending. VM only works if the learner watches. Practitioners should verify visual attention to the screen before beginning sessions and use preferred characters, scenarios, or self as model when attention is low. For some learners, pointing to the screen or using a brief attending prompt before the clip improves subsequent imitation.
Voiceover and on-screen text. Adding narration and text to a VM clip consistently improves outcomes for both learners and staff trainees, likely because the audio channel supplements visual information and helps the learner track what to attend to. Day-Watkins and colleagues showed voiceover VM integrated with behavioral skills training produced durable gains in staff ability to implement social-skills instruction across 2–3 sessions with 100% accuracy maintained at follow-up source.
Viewing perspective. First-person (POV) and third-person cameras affect performance differently depending on the task and learner. Quinn and colleagues found that for one of four competitive dancers, the viewing perspective of the model determined whether improvement occurred source. When a learner fails to acquire from standard VM, assessing whether POV or a different model perspective produces better responding is a data-driven troubleshooting step.
Prompts during and after viewing. VM is typically implemented as a relatively clean antecedent: show the clip, then present the opportunity. Adding prompts during the imitation opportunity follows standard prompting hierarchy principles. Importantly, evidence indicates that providing prompts or hints during the viewing itself (not just after) can disrupt the observational-learning process by creating prompt dependency before the imitation trial. The exception is when the clip itself embeds prompts as an instructional design feature (interactive VM).
Number of repetitions. Multiple corpus studies show 40–60-second clips repeated 2–3 times per session, with counterbalanced order across modalities or exemplars, produces reliable acquisition without excessive session time source.
Video modeling with added feedback
For complex motor or social skills, VM alone may produce sub-proficient performance, particularly when the skill involves precise motor timing or fine-grained feedback. Quinn and colleagues found VM alone improved dance-skill performance modestly for competitive dancers, while adding video feedback — the learner viewing a recording of their own performance — produced large additional gains source. Similarly, Capalbo and colleagues found VM alone was insufficient for achieving proficient goalkeeper skills in youth soccer; pairing VM with video feedback of the learner's own attempts was necessary for robust gains source.
For practitioners: when VM alone is not moving performance data, adding video feedback of the learner's own performance is the logical next step before adding more intrusive prompting.
Self-instructional video modeling for adolescent and adult learners
Adolescent and adult learners can be taught to use VM independently — a self-instructional variant in which the learner accesses the clip on demand, selects the relevant exemplar, and uses it as a job aid without ongoing staff involvement. Athorp and colleagues demonstrated this with portable VM on a personal device for a young adult with intellectual disability in a university transition program: the learner carried a smartphone with VM clips for four daily living skills and completed tasks independently across the campus setting source.
For employment settings, McLucas and colleagues' VM-plus-feedback protocol for vocational social skills showed 1–2 minute researcher-made clips on standard phones were sufficient as stimuli, with generalization to novel work tasks and materials source. The self-instructional design matters for transition goals because it builds independence rather than prompt dependency on a staff member's demonstration.
Mobile and tablet implementation
Modern VM practice is predominantly mobile. Smartphones and tablets allow: (a) storing clips in a learner-accessible library; (b) presenting VM in natural contexts immediately before task performance; (c) recording learner performance for video feedback; (d) sharing clips with caregivers and school staff for cross-setting consistency. The portable VM literature consistently shows equivalent or superior outcomes compared to computer-based presentation, with the added benefit of ecological validity — the learner practices in the real setting where the skill will be used.
Technical production no longer requires professional equipment. Many published studies use consumer smartphones, basic video editing apps, and non-professional actors. The key production standards are: sufficient lighting to see the target behaviors clearly; camera angle that displays the relevant stimuli (first-person for manual tasks, third-person for social interactions); duration short enough to maintain learner attention (40–90 seconds in most published protocols); and audio quality sufficient for narration if included.
Caregiver and parent delivery
VM is uniquely well-suited to caregiver delivery because it does not require the caregiver to perform a live flawless demonstration — the clip delivers the consistent exemplar, and the caregiver manages the opportunity-to-imitate, prompting, and reinforcement. This separation reduces caregiver skill requirements compared to in-vivo modeling.
Yarzebski and Dickson demonstrated that a single exemplar video with voiceover narration and on-screen text trained three caregivers of children with ASD to implement graduated guidance in under 26 minutes, with generalization to untrained activities and high social validity ratings source. The training required no additional coaching, verbal feedback, or rehearsal beyond the video itself.
Vascelli and Berardo showed VSM delivered via telehealth, with a parent trained via Skype to reach 90% fidelity in role-play before running sessions, produced skill acquisition maintained at two months for a child with Dravet syndrome source. The parent-facilitated telehealth VM format is particularly relevant for families in underserved regions where in-person BCBA access is limited.
Sibling-mediated VM adds another home-delivery pathway. Neff and colleagues showed VM alone — without direct coaching — taught typically developing siblings (ages 7–11) to prompt and reinforce appropriate play with their brothers or sisters with autism; two of three dyads met mastery criteria, and on-task play for the child with autism increased source.
Culturally matched videos matter. Vargas Londono and colleagues demonstrated that session-by-session videos created in the caregiver's native language (Spanish or English) maintained BST training fidelity during telehealth delivery for Latino caregivers of autistic children, with no significant outcome difference by language version, supporting the practice of producing VM stimuli that match the family's language and cultural context source.
Staff training applications
VM has become a standard component of behavioral skills training for ABA staff. Documented staff training targets include: DTT implementation, preference assessment administration, graduated guidance, performance feedback delivery, functional communication training, assent-withdrawal support, and tact training. Across this literature, the common finding is that VM alone reaches mastery for approximately 40–55% of trainees; the remainder require added rehearsal and feedback.
Bovi and colleagues showed VM with voiceover instruction trained public school staff to implement a paired-stimulus preference assessment to 90% procedural integrity, maintained at eight weeks post-training source. Shuler and Carroll showed VM with voiceover trained BCBAs to deliver accurate performance feedback in 1–2 exposures, with three of four supervisors reaching mastery source. Taber, Lambright, and Luiselli showed a single 2-minute VM session reliably increased two educational care-providers' use of student-preferred attention functions, with generalized gains in untrained attention forms source.
Adding nonexemplars — clips that show the specific errors staff make — is the highest-leverage modification for organizations seeking durable procedural integrity. Bartle, Ruby, and DiGennaro Reed found that VM with exemplars plus nonexemplars for high-error steps significantly raised and sustained staff procedural integrity for DTT and MSWO relative to exemplar-only VM, with no added trainer time required beyond editing the existing video source. Identify the steps staff most often get wrong from post-training data, splice in brief nonexemplar clips showing those errors, and staff performance improves without additional supervision.
Despite these efficiencies, VM remains underused in ABA fieldwork. A 2025 survey of 137 ABA trainees found approximately 34% reported rarely or never receiving VM during supervision, even though trainees valued modeling practices source. The gap between available evidence and routine practice is a system-level problem, not a learner-selection problem.
02Evidence Tier Breakdown
The VM evidence base is dense at the single-subject level, supported by systematic reviews and meta-analyses, with a small head-to-head comparative literature. Large RCTs are absent — typical for behavior-analytic procedures where single-subject demonstration and replication across participants is the methodological standard.
Systematic reviews and meta-analyses. Multiple published meta-analyses (Bellini & Akullian, 2007; Mason et al., 2013) established VM as an evidence-based practice for autistic learners across communication, social, and daily living targets. Corpus papers reference these foundational reviews without reprinting their data. Marano and colleagues' review of 18 staff-training VM experiments found positive outcomes in 44% and mixed outcomes in 55%, with variability linked to whether supplemental characteristics (voiceover, on-screen text, rehearsal) were included source.
Comparative experimental work. Thomas and colleagues' alternating-treatments comparison of VM versus video prompting (n=4, meal preparation in adolescents with ASD) is the strongest within-corpus head-to-head, favoring VM on mastery speed and error rate source. Quinn and colleagues compared VM alone vs. VM-plus-video-feedback for complex motor skills in dancers (n=4), with the combined package producing larger gains source. Capalbo and colleagues replicated this pattern for soccer goalkeeping skills (n=2) source. Bloh and colleagues compared human vs. animated VM (n=8, intraverbal/imitation), finding no consistent superiority for either modality source.
Single-subject experimental designs. The main evidence layer. Vocational: McLucas et al. (n=4) source. Daily living: Aldi et al. POV VM (n=2) source; Athorp et al. portable VM (n=1) source. Social/play: Ezzeddine et al. play comments (n=6) source. VSM: Vascelli & Berardo parent-telehealth VSM (n=1) source. Sibling-mediated VM: Neff et al. (n=6 in 3 dyads) source. Staff training: Bartle et al. exemplar+nonexemplar VM (n=6) source; Taber et al. attention delivery (n=2) source; Bovi et al. preference assessment VM (n=not specified) source; Shuler & Carroll supervisor feedback VM (n=4) source; Day-Watkins et al. VMVO+BST social skills (n=3) source. Caregiver training: Yarzebski & Dickson graduated guidance VM (n=3) source.
Survey and descriptive. Čolić and colleagues' trainee survey (n=137) establishes the underuse of VM in ABA fieldwork supervision source.
Bottom line. The evidence strongly supports VM as an efficient, broadly effective teaching procedure for learners with ASD and for staff and caregiver training. The corpus is clear on two boundary conditions: VM alone is insufficient for a subset of learners (who need added prompting, reinforcement, or feedback); and maintenance requires planning — gains at mastery do not automatically persist without post-training procedures.
03Across Settings
Clinic and outpatient skill acquisition
Clinic sessions are where VM is easiest to implement with procedural fidelity: controlled viewing conditions, consistent session structure, and immediate staff support for the imitation opportunity. VM is well-suited as an antecedent for discrete trial teaching (DTT) in the clinic when the target behavior is observationally learned rather than purely discrimination-based. DTT and VM are complementary, not competing: VM establishes the target response class; DTT provides massed practice trials with programmed consequences.
For communication and verbal behavior targets in the clinic, short VM clips showing vocal intraverbals, conversational exchanges, or social routines are among the most replicated applications. Presenting clips 2–3 times per session with counterbalanced ordering across exemplars is a standard format source.
Classroom: life skills and social skills
In special-education classrooms, VM addresses two overlapping targets: (a) life skills, for which POV VM on a tablet is particularly effective because the learner accesses clips during task execution in the natural classroom or kitchen environment; and (b) social skills, including initiating interactions, commenting during play, and greeting routines, where third-person VM shows a peer or adult model performing the behavior in a recognizable social context.
For social skills, the evidence indicates VM alone is often sufficient for learners with moderate to high social learning readiness, while learners with lower social-observation skills benefit from pairing VM with explicit reinforcement and prompting following each viewing source. Emotion recognition is a consistent VM target in classrooms; animated video stimuli depicting diverse everyday scenarios for fear, anger, and social situations are producible with accessible tools and systematically vary facial and contextual cues source.
Home delivery
VM's key advantage at home is that it replaces the need for the parent to perform a perfect live demonstration. A parent who watches a VM clip is not being asked to model the skill themselves — they are managing the opportunity-to-imitate and the consequent. This substantially lowers the skill demand on the caregiver while maintaining consistency in the antecedent. Caregiver-delivered VM trained via telehealth is well-documented: Yarzebski and Dickson showed a single video was sufficient; Vascelli and Berardo showed telehealth-trained parent delivery of VSM produced durable gains source source. Sibling-mediated VM adds another home-based pathway without requiring ongoing BCBA or caregiver mediation of each session once the sibling has been trained source.
Vocational training
VM is particularly efficient for vocational settings because: (a) job tasks are discrete and observable, lending themselves to task analysis and step-by-step or whole-task VM; (b) adults with ASD and intellectual disability often prefer models over abstract verbal instruction; and (c) smartphones allow portable, on-demand access to job-aid videos at the worksite. McLucas and colleagues showed 1–2 minute VM clips for vocational social behaviors — requesting missing materials, problem-solving communication — generalized across novel tasks and materials source. Athorp and colleagues demonstrated the self-instructional model: a learner carries their own device, accesses their own clips, and performs tasks independently source.
04Common Pitfalls
- Low video quality or obscured target behaviors. If the relevant stimulus is off-camera, in shadow, or too brief, the learner cannot form the discriminative control needed for imitation. Record clips at consistent lighting, appropriate angle, and with the target response fully visible.
- No structured imitation opportunity immediately after viewing. VM is an antecedent; it must be followed immediately by a clear, unambiguous opportunity to perform the target behavior. Showing the clip and then waiting several minutes or shifting to a different task removes the functional connection between observation and imitation.
- Fading VM before stimulus generalization is established. Practitioners sometimes remove VM access once mastery is reached in the training context, only to find the skill does not transfer to novel settings, people, or materials. Plan generalization probes before fading, and consider keeping VM as a portable self-instruction tool in novel settings.
- Model selection that does not match the learner's characteristics. A model who is too different from the learner in age, gender, or observable characteristics can reduce imitation rates, particularly for social skills. When possible, use same-age, same-gender peers as models for social-skills VM.
- VM-only staff training for high-stakes procedures. VM alone produces mixed outcomes for staff — roughly half reach mastery without additional support. For procedures requiring high integrity (functional analysis, preference assessment, behavior-reduction procedures), plan to include rehearsal and feedback after VM viewing rather than assuming mastery from video alone source.
- Exemplar-only video content for staff training. Failing to include nonexemplar clips for the steps staff most commonly err on leaves procedural integrity drift unaddressed. Scan post-training data, identify high-error steps, and splice nonexemplars targeting those specific steps source.
- No maintenance planning after mastery. Aldi and colleagues found POV VM produced ADL mastery that fell below criterion at one-month follow-up when no maintenance procedures were in place source. Plan brief post-mastery probes and booster sessions before discharge.
- Ignoring viewing perspective for individual learners. Most programs default to third-person VM, but some learners perform better with POV and vice versa. A brief assessment comparing imitation rates across perspectives can identify the more effective format before committing to a production approach.
- Passive viewing without confirmed attending. A learner who is physically present but not attending to the screen will not acquire from VM. Confirm visual attention before starting the clip; use attending prompts only as needed, not routinely, to preserve the clean antecedent function of the video.
05Decision Logic: VM vs. In-vivo Modeling vs. DTT
When to prioritize video modeling:
- Target behavior is a complex, multi-step, or dynamic skill that a video can depict more consistently than a live model.
- The same clip will be delivered across settings, contexts, or caregivers — consistency across implementers is a program goal.
- The learner has a history of observational learning from video (TV, YouTube, screen time) suggesting screen-based modeling will be attended to.
- Caregiver or staff skill in live modeling is variable — VM standardizes the exemplar regardless of who facilitates the session.
- Self-instructional access is a goal (transition-age or adult learner who will carry a device and access VM independently on the job or at home) source.
When to prioritize in-vivo modeling:
- The behavior requires real-time responsiveness to the learner's specific output (fluid conversation, graduated physical guidance).
- The learner does not attend to screens or does not imitate from video despite adequate attending.
- The target involves subtle real-world physical guidance that a 2D clip cannot capture (e.g., hand shaping during writing instruction).
When to prioritize DTT without VM as the primary antecedent:
- The target is a discrimination or matching task, not an observational-learning target — there is no model for the learner to imitate.
- The learner's current repertoire is below the threshold needed to observe and replicate even simple modeled behaviors.
- High trial frequency is the primary therapeutic variable — DTT's massed-trial structure is more efficient than VM for most prerequisite discrimination skills.
VM + DTT combined: For communication targets (intraverbals, conversational routines), VM as the antecedent for DTT trials is a common and well-supported format. Show the clip, then run a trial immediately; VM functions as an establishing operation that increases the salience of the target response class before trials begin source.
VM + in-vivo prompting combined: For ADL and daily living targets, POV VM followed immediately by least-to-most prompting during the task is a documented and effective combination. The VM provides the complete task model; the prompts address individual steps where the learner needs support source.
06Practitioner Takeaways
VM is an antecedent, not a standalone intervention. Its function depends entirely on the opportunity-to-imitate and reinforcement that immediately follow. The clip must be followed by a clear, unambiguous trial source.
Match variant to target and learner. Standard VM for novel skills with a competent available model; VSM for skills where the learner already has component behaviors but does not chain them; POV VM for daily living and vocational tasks with spatial orientation demands; interactive VM (with voiceover and on-screen text) for staff and caregiver training source.
Add voiceover narration and on-screen text for training adults. Across the staff and caregiver training literature, adding narration and text consistently improves outcomes without adding trainer time source source.
For staff training, scan post-training data and add nonexemplar clips for high-error steps. Exemplar-only VM does not prevent common errors from recurring; nonexemplars targeting those specific steps do, without requiring additional live supervision time source.
Animated and human models are comparably effective for most learners. If producing a live-action clip is difficult, animated stimuli created with accessible tools are a valid alternative source.
VM alone produces mastery for roughly half of learners; plan for the other half. For learners who do not acquire from VM alone, add explicit reinforcement, modeling plus prompting, or video feedback of the learner's own performance before escalating to more intensive procedures source source.
Adding video feedback of the learner's own performance accelerates gains for complex motor and social skills. When VM alone is not producing proficiency, show the learner a recording of their own performance immediately after the model clip as the next modification source source.
Plan maintenance before fading VM access. Run maintenance probes at one month post-mastery; design brief booster exposure or portable self-access before discharge rather than assuming maintenance will be automatic source.
Mobile delivery in natural contexts. Smartphone-based VM presented immediately before the task in the setting where the skill will be performed improves generalization and supports self-instructional independence for adolescent and adult learners source source.
Caregiver delivery is feasible with a single well-designed clip plus brief telehealth coaching. The video handles the modeling demand; the caregiver manages the trial structure. This is a sustainable home-based format that does not require high caregiver modeling skill source source.
Match model to learner demographics for social skills targets. Same-age, same-gender models improve social validity and imitation rates; this matters most for peer-interaction and communication targets.
VM is underused in ABA supervision despite a strong evidence base. About 34% of ABA trainees report VM is rarely or never used in their fieldwork supervision; supervisors should deliberately include VM demonstrations as a BST component for direct-care and assessment procedures source.
07FAQ
What is the difference between video modeling and video prompting? Video modeling presents the complete target behavior as a full demonstration before the learner attempts the skill. Video prompting presents individual steps of a task one at a time as the learner works through the sequence. Thomas and colleagues compared the two directly for meal-preparation skills in adolescents with ASD; VM produced faster mastery and approximately 60% fewer errors, while video prompting produced comparable on-task behavior. Use VM when the learner can represent and reproduce the full task; use video prompting when the task is long or the learner requires step-by-step on-demand support source.
Can animated models be used instead of human models? Yes. Bloh and colleagues' comparison of human and animated VM for eight children with ASD found both produced gains in intraverbal responding and motor imitation, with no consistent superiority for either modality. Practitioners can use animated clips created with tools like Vyond when live-action recording is difficult, without expecting reduced effectiveness source.
Does video self-modeling (VSM) require the learner to perform the skill perfectly for recording? Not necessarily. In feedforward VSM, editing and prompting during recording can be used so the final clip shows the learner performing the skill at a level above their current independent baseline — then errors and prompts are edited out. The resulting clip shows a competent model who happens to be the learner themselves. For simpler targets, recording an existing skill and feeding it back in context is sufficient. Vascelli and Berardo's telehealth VSM example trained the parent to record five-trial sessions and relay clips back to the experimenter for selection, with the mother presenting chosen clips at each subsequent session source.
When should VM be combined with video feedback? Add video feedback — showing the learner a recording of their own recent performance — when VM alone is not producing proficiency, particularly for complex motor skills, athletic training, or performance arts. Quinn and colleagues and Capalbo and colleagues both found VM alone produced modest improvement while VM plus video feedback of the learner's own attempts produced large gains source source. Conduct a brief assessment: if 3–4 VM sessions do not move performance data, add video feedback before escalating to other procedures.
How many times should the clip be shown per session? Two to three repetitions of a 40–60 second clip per session is the most common format in published protocols, with counterbalanced ordering when multiple exemplars are used. Showing the clip once may be insufficient for complex behaviors; showing it more than three times without an imitation opportunity in between may produce passive attending without active observational learning source.
Is VM sufficient for staff training without added rehearsal and feedback? For approximately 40–45% of staff, VM alone produces mastery. For the remaining 50–55%, added rehearsal or feedback is needed. The most efficient upgrade is adding nonexemplar clips for high-error steps based on post-training data source. For complex behavioral procedures like functional analysis, VM as a standalone training method is unlikely to produce clinically adequate fidelity; combine VM with behavioral rehearsal source.
Can parents deliver VM at home without in-person BCBA support? Yes, when trained via telehealth. The critical components are: (1) training the parent to reach acceptable fidelity in managing the opportunity-to-imitate, prompting, and reinforcement — not in performing a live model; (2) providing a single well-constructed VM clip that handles the modeling demand consistently; (3) brief coaching via video call for initial sessions to correct procedural drift. Yarzebski and Dickson demonstrated this format required less than 26 minutes of training per caregiver with no additional support beyond the video source; Vascelli and Berardo extended it to telehealth VSM during a lockdown period with durable results source.
What maintenance procedures are needed after VM-based skill acquisition? At minimum, plan probes at one month post-mastery and build in a brief booster exposure if performance falls below criterion. For ADL and vocational targets in natural settings, converting VM from a training tool to a portable self-instructional job aid — accessible on a personal device on demand — is the most functional maintenance strategy because it keeps the discriminative stimulus available without ongoing staff involvement. Aldi and colleagues found that without planned maintenance procedures, ADL performance fell below mastery criterion at one month while remaining above baseline source.
08References
Aldi, C., Crigler, A., Kates-McElrath, K., Long, B., Smith, H., Rehak, K., & Wilkinson, L. (2016). Examining the Effects of Video Modeling and Prompts to Teach Activities of Daily Living Skills. Behavior Analysis in Practice, 9(4), 384–388. https://doi.org/10.1007/s40617-016-0127-y
Athorp, S. M., Stuart, S. K., & Collins, J. C. (2022). Building Daily Living Skills Through Portable Video Modeling. Education and Treatment of Children, 45(3), 293–297. https://doi.org/10.1007/s43494-022-00077-3
Bartle, G. E., Ruby, S. A., & DiGennaro Reed, F. D. (2025). The effects of video modeling containing different exemplar types on procedural integrity. Journal of Organizational Behavior Management. https://doi.org/10.1080/01608061.2025.2476425
Bloh, C., Bacon, L., Begel, B., Madara, K., & Koller, B. (2025). Comparing human video modeling to animated video modeling for learners with autism. The Analysis of Verbal Behavior, 41, 262–279. https://doi.org/10.1007/s40616-025-00224-y
Bovi, G. M. D., Vladescu, J. C., DeBar, R. M., Carroll, R. A., & Sarokoff, R. A. (2017). Using Video Modeling with Voice-over Instruction to Train Public School Staff to Implement a Preference Assessment. Behavior Analysis in Practice, 10(1), 72–76. https://doi.org/10.1007/s40617-016-0135-y
Capalbo, A., Miltenberger, R. G., & Cook, J. L. (2022). Training soccer goalkeeping skills: Is video modeling enough? Journal of Applied Behavior Analysis, 55(3), 958–970. https://doi.org/10.1002/jaba.937
Čolić, M., Ninci, J., Huntington, R. N., Bristol, R. M., Taylor, G., & Araiba, S. (2025). An investigation of trainees' supervision experiences in applied behavior analysis fieldwork. Behavior Analysis in Practice. https://doi.org/10.1007/s40617-025-01132-2
Day-Watkins, J., Pallathra, A. A., Connell, J. E., & Brodkin, E. S. (2018). Behavior Skills Training with Voice-Over Video Modeling. Journal of Organizational Behavior Management, 38(2–3), 258–273. https://doi.org/10.1080/01608061.2018.1454871
Ezzeddine, E. W., DeBar, R. M., Reeve, S. A., & Townsend, D. B. (2020). Using video modeling to teach play comments to dyads with ASD. Journal of Applied Behavior Analysis, 53(2), 767–781. https://doi.org/10.1002/jaba.621
LaMarca, V. J., & LaMarca, J. M. (2024). Using the ADDIE model of instructional design to create programming for comprehensive ABA treatment. Behavior Analysis in Practice, 17, 371–388. https://doi.org/10.1007/s40617-024-00908-2
Linnehan, A. M. (2025). Teaching autistic adolescents to identify fear and anger: a preliminary study. Behavior Analysis in Practice. https://doi.org/10.1007/s40617-025-01129-x
McLucas, A. S., Som, S., Fleming, J., Ingvarsson, E., & Therrien, W. J. (2024). Using Video Modeling Plus Feedback to Teach Vocational Social Skills to Employment-Aged Autistic Youth. Journal of Behavioral Education. https://doi.org/10.1007/s10864-024-09561-9
Neff, E. R., Betz, A. M., Saini, V., & Henry, E. (2017). Using video modeling to teach siblings of children with autism how to prompt and reinforce appropriate play. Behavioral Interventions, 32(3), 193–205. https://doi.org/10.1002/bin.1479
Ólafsdóttir, A. B., Sveinbjörnsdóttir, B., & Gunnarsson, K. F. (2025). Expanding the pyramidal staff training approach. Journal of Organizational Behavior Management. https://doi.org/10.1080/01608061.2025.2499447
Quinn, M., Narozanick, T., Miltenberger, R., Greenberg, L., & Schenk, M. (2020). Evaluating video modeling and video modeling with video feedback to enhance the performance of competitive dancers. Behavioral Interventions, 35(1), 76–83. https://doi.org/10.1002/bin.1691
Shuler, N. & Carroll, R. A. (2019). Training Supervisors to Provide Performance Feedback Using Video Modeling with Voiceover Instructions. Behavior Analysis in Practice, 12(3), 576–591. https://doi.org/10.1007/s40617-018-00314-5
Taber, T. A., Lambright, N., & Luiselli, J. K. (2017). Video Modeling Training Effects on Types of Attention Delivered by Educational Care-Providers. Behavior Analysis in Practice, 10(2), 189–194. https://doi.org/10.1007/s40617-017-0182-z
Thomas, E. M., DeBar, R. M., Vladescu, J. C., & Townsend, D. B. (2020). A Comparison of Video Modeling and Video Prompting by Adolescents with ASD. Behavior Analysis in Practice, 13(1), 40–52. https://doi.org/10.1007/s40617-019-00402-0
Togashi, K. (2025). Training in trial-based functional analysis via computer-based instruction and behavioral skills training. Behavior Analysis in Practice. https://doi.org/10.1007/s40617-025-01136-y
Vargas Londono, F., Falcomata, T. S., Lim, N., Ramirez-Cristoforo, A., Paez, Y., & Garza, A. (2024). Do cultural adaptations matter? Comparing caregiver training in different language for Latino caregivers of autistic children: A telehealth-based evaluation. Behavior Analysis in Practice, 17, 1113–1133. https://doi.org/10.1007/s40617-024-00930-4
Vascelli, L. & Berardo, F. (2022). Video Self-Modeling for a Student with Dravet Syndrome: An Intervention Involving Parents during COVID-19 Pandemic in Italy. Education and Treatment of Children, 45(1), 129–133. https://doi.org/10.1007/s43494-021-00063-1
Yarzebski, V., & Dickson, C. (2024). Teaching caregivers to use graduated guidance using video modeling. Behavior Analysis in Practice, 17, 1198–1203. https://doi.org/10.1007/s40617-024-00969-3