How Visual Continuity Works in AI Storyboards: Lessons from CANVAS

Every AI storyboard tool faces the same problem: generate panel 1 and panel 2 separately, and the character's hair changes, the room rearranges itself, and a prop that was stolen three scenes ago reappears on the table. This is the visual continuity problem, and a recent paper from the University of Maryland and Google — called CANVAS — offers the most rigorous solution we have seen to date.

This article breaks down what CANVAS does, why its approach matters, and what practical lessons storyboard tool builders (including us) can take from it.

A professional cinematic storyboard showing 5 sequential panels of a detective investigating a museum heist. The detective's appearance remains consistent across all panels — dark trench coat, fedora hat, and the museum architecture stays stable throughout.

The Three Ways AI Storyboards Break

Before looking at solutions, it helps to name the specific failure modes. They are not all the same problem, even though they look similar.

A deliberately inconsistent storyboard showing 3 panels with red X marks and circles highlighting errors — hair changing between panels, jacket color switching, and the room completely transforming from a modern apartment to a rustic cabin.

1. Character Drift

The character's face, hair, clothing, or body changes between panels. This happens because each image is generated independently — the model has no memory of what the character looked like in the previous panel. "A man with brown curly hair and a blue jacket" generates a slightly different person every time.

2. Scene Reconstruction Failure

When the story returns to a location it showed earlier (say, the museum gallery in shot 1 appears again in shot 5), the AI rebuilds the room from scratch. The brick pattern changes, the furniture moves, the windows are in different positions. The audience loses spatial orientation.

3. Prop State Errors

A golden artifact sits in a glass case in shot 1. It gets stolen in shot 3. But in shot 5, when the detective returns to the gallery, the artifact is back in the case — because the AI does not track that the object's state changed. This is the subtlest failure, and the hardest for most systems to catch.

What CANVAS Does Differently

CANVAS (Mondal et al., 2026) introduces three interlocking mechanisms that address each failure mode. The key insight is that these are not three separate problems — they are all manifestations of missing world-state tracking.

Global Story Planning: State Tables for Everything

Before generating any image, CANVAS builds a global continuity plan. This plan defines three formal mappings:

C(character, shot) → appearance_state
L(shot) → location_identity
O(object, shot) → object_state

In plain language:

Character mapping tracks what each character looks like at each point in the story. If Lena wears a blue dress in shots 1-2, switches to a security jacket in shots 3-4, and changes back in shot 5, the plan encodes this explicitly.
Location clustering groups shots by physical location. Shots 1, 2, 4, and 5 all happen in the museum gallery — even though shots 3 interrupts the sequence in a different room. The system knows they are the same place.
Object state tracking records how props evolve. The golden artifact transitions from "present" (shots 1-3) to "stolen" (shot 4) to "absent" (shot 5). This prevents ghost objects.

A cinematic storyboard with 4 panels tracking a golden pocket watch through state changes — INTACT on a desk, RETRIEVED by a hand, DAMAGED on a stone floor with cracked glass, and FROZEN showing the time stopped at 3:47. The watch design stays consistent but its condition degrades across panels.

The plan is generated by an LLM processing the full shot descriptions before any image generation begins. It is essentially a structured checklist that says "in this shot, this character should look like this, this room should look like that, and this object should be in this state."

Visual Memory Bank: Anchors That Persist

Planning alone is not enough — the image generator needs visual references, not just text descriptions. CANVAS maintains a persistent visual memory with three components:

A technical blueprint-style diagram showing the Visual Memory Bank architecture — Character Anchors, Location Anchors, and Frame History feed into an Image Generator via retrieval, and updates flow back from generated frames into the memory bank.

Character anchors (Mc): A reference image for each character in each appearance state. When Lena is in her blue dress, the system stores one clear image of her. When she changes to the security jacket, a new anchor is stored.
Location anchors (Ml): A background reference for each physical location. The museum gallery gets one anchor image. When the story returns to this location, the system retrieves this anchor instead of generating the room from scratch.
Frame history (Mf): Previously generated frames, used to maintain spatial continuity in consecutive shots within the same scene.

The retrieval strategy depends on the relationship between the current shot and what came before:

Situation	What Gets Retrieved	Why
Same scene, next shot	Previous frame + character anchors	Preserves spatial layout and camera continuity
Returning to a location after a gap	Location background anchor + character anchors	Rebuilds the room from stored reference, not from scratch
Entirely new location	Character anchors only	No location reference exists yet — one will be created

This is the part that makes the biggest measurable difference. In CANVAS's ablation experiments, removing background reuse caused a 57% drop in background consistency. Removing character anchors caused a 39% drop in facial consistency.

Scene Revisit: Why "Going Back to the Same Room" Is Hard

This deserves special attention because it is the failure mode that most tools ignore entirely.

A cinematic storyboard showing 3 panels of the same coffee shop interior at different times of day — morning with golden sunlight, afternoon with soft light and a couple at a table, and evening with warm pendant lamp lighting. The architectural details — exposed brick walls, chalkboard menu, marble counter — remain perfectly consistent across all three panels.

When a story cuts away from a location and returns later, standard AI generators have no memory of what the room looked like. They generate a new room that roughly matches the text description but differs in every specific detail — wall color, furniture placement, window positions.

CANVAS solves this with location clustering. The system pre-groups shots by location identity, so when shot 5 returns to the museum gallery, it knows to retrieve the gallery's stored background anchor instead of generating a new room. The result: architectural details stay consistent even after multiple intervening scenes.

In the ablation experiments, removing location grouping reduced background consistency from 4.94 to 2.45 (a 50% drop) for consecutive shots, and from 4.73 to 3.34 for non-consecutive scene reappearances. This was the single highest-impact component in the entire system.

Candidate Selection: Generate Multiple, Pick the Best

For each shot, CANVAS generates multiple candidate frames (typically 3) and scores them against the global plan using a VLM-based evaluator. The scoring checks:

Does the character's appearance match the plan's expected state?
Does the background match the location assignment?
Are props in the correct state?
Does the image align with the shot description?

The best-scoring candidate is selected. The ablation shows diminishing returns beyond N=3 candidates — character consistency improves meaningfully from N=1 to N=3, but backgrounds and props are already well-handled by the anchor system alone.

Character Consistency: The Anchor Is Everything

The single most impactful component in CANVAS is the canonical character anchor — a reference image that defines what each character looks like. Without it, all three character consistency metrics (face, clothing, body) drop by approximately 39%.

A storyboard panel showing a young woman scientist with short black hair, round glasses, and a white lab coat standing in a futuristic laboratory with blue holographic displays.

The same scientist character in a close-up shot, leaning over a microscope in the same laboratory. Her appearance — short black hair, round glasses, white lab coat with ID badge — is perfectly consistent with the previous panel.

The two panels above illustrate the goal: the same character in different compositions and angles, but unmistakably the same person. The short black hair, round glasses, white lab coat, and ID badge all persist. The laboratory environment with its blue holographic displays stays consistent in the background.

This aligns with what practitioners already know — reference images matter more than prompt descriptions for maintaining identity. But CANVAS adds an important nuance: the anchor should be updated over time. After each frame is generated, the system extracts updated character and location anchors from the result. This means the visual memory evolves as the story progresses, capturing style drift that accumulates during generation.

The Numbers: What the Experiments Show

CANVAS was evaluated on three benchmarks using ContinuityEval, a VLM-based scoring framework (Likert 1-5 scale, κ=0.74 agreement with human judges):

Metric	CANVAS	Best Baseline	Improvement
Background (non-consecutive)	4.83	4.00	+21%
Background (consecutive)	4.83	4.56	+6%
Prop consistency	4.88	4.56	+7%
Character average	4.30	4.12	+4%

On HardContinuityBench (designed specifically for difficult continuity scenarios like long-range scene reappearances and frequent costume changes), the improvements are larger: +27% for non-consecutive backgrounds, +15% for character consistency.

Human preference study (240 pairwise comparisons, 15 annotators): CANVAS achieves 90.9% win-rate for background consistency, 86.7% for character consistency.

What Matters Most: Ablation Rankings

The ablation experiments reveal which components contribute the most:

Component Removed	Impact
Location grouping + background reuse	-50 to -57% background consistency
Canonical character anchors	-39% character consistency
Prop state updates	-20% prop consistency
Character memory updates	-2 to -7% character consistency
Candidate selection (QA scoring)	Moderate improvement mainly for characters

The takeaway is clear: the anchor system is more important than the scoring system. Getting the right reference images into the generation process matters more than selecting among multiple candidates after the fact.

What This Means for AI Storyboard Tools

CANVAS is a research paper, not a product. It runs as a fully automated pipeline with no human interaction. But the techniques it validates have direct implications for interactive storyboard tools.

1. Props Need State Timelines, Not Just Existence Tracking

Most storyboard tools track which props appear in which shots. That is not enough. A prop can be "present," "held," "damaged," "stolen," or "destroyed" — and the visual representation must change accordingly. Tracking state transitions prevents the most jarring continuity errors.

2. Scene Revisit Detection Should Be Automatic

When a user designs a storyboard that returns to a previously shown location, the tool should automatically detect this and pull in the visual reference from the earlier appearance. This is the highest-impact single feature based on CANVAS's ablation data.

3. Characters Have Multiple Appearances

A character is not one static reference image. They change clothes, get injured, age across the story. The system needs to support "appearance states" — the same character with different visual configurations — and know which state applies at each point in the narrative.

4. Visual Consistency Needs Quantitative Evaluation

Checking storyboard consistency by eye is unreliable and slow. CANVAS demonstrates that VLM-based evaluation (scoring face/clothing/body consistency on a 1-5 scale) correlates strongly with human judgment. This can be automated as a quality check before presenting results to the creator.

5. The Human-in-the-Loop Advantage

CANVAS's own limitations section notes that the system "assumes reliable narrative entity extraction," which is "challenging for ambiguous or highly abstract stories." Their future work section explicitly calls for "interactive human-in-the-loop visual storytelling."

This is the gap between a research pipeline and a creative tool. A fully automated system must guess what the director intends. An interactive system can ask. When the director says "I want the room to feel different when the character returns — colder, more empty," a dialogue-driven tool can distinguish intentional changes from continuity errors. An automated pipeline cannot.

Limitations of the CANVAS Approach

The paper is transparent about its constraints:

Computational cost: Generating 3 candidates per shot and evaluating each with a VLM means roughly 1 + C + S(1 + 3K) API calls per story (C = characters, S = shots, K = candidates). For a 20-shot story with 3 characters and K=3, that is approximately 184 API calls.
No motion modeling: The system handles static frames, not motion dynamics or complex physical interactions between characters.
Small benchmark: HardContinuityBench contains only 5 stories. The authors plan to expand it.

References

Mondal, I., Song, Y., Parmar, M., Goyal, P., Boyd-Graber, J., Pfister, T., & Song, Y. (2026). CANVAS: Continuity-Aware Narratives via Visual Agentic Storyboarding. arXiv preprint arXiv:2604.13452. https://arxiv.org/abs/2604.13452

All illustrations in this article were generated using Story2Board's image generation models.