Every AI storyboard tool faces the same problem: generate panel 1 and panel 2 separately, and the character's hair changes, the room rearranges itself, and a prop that was stolen three scenes ago reappears on the table. This is the visual continuity problem, and a recent paper from the University of Maryland and Google — called CANVAS — offers the most rigorous solution we have seen to date.
This article breaks down what CANVAS does, why its approach matters, and what practical lessons storyboard tool builders (including us) can take from it.

The Three Ways AI Storyboards Break
Before looking at solutions, it helps to name the specific failure modes. They are not all the same problem, even though they look similar.

1. Character Drift
The character's face, hair, clothing, or body changes between panels. This happens because each image is generated independently — the model has no memory of what the character looked like in the previous panel. "A man with brown curly hair and a blue jacket" generates a slightly different person every time.
2. Scene Reconstruction Failure
When the story returns to a location it showed earlier (say, the museum gallery in shot 1 appears again in shot 5), the AI rebuilds the room from scratch. The brick pattern changes, the furniture moves, the windows are in different positions. The audience loses spatial orientation.
3. Prop State Errors
A golden artifact sits in a glass case in shot 1. It gets stolen in shot 3. But in shot 5, when the detective returns to the gallery, the artifact is back in the case — because the AI does not track that the object's state changed. This is the subtlest failure, and the hardest for most systems to catch.
What CANVAS Does Differently
CANVAS (Mondal et al., 2026) introduces three interlocking mechanisms that address each failure mode. The key insight is that these are not three separate problems — they are all manifestations of missing world-state tracking.
Global Story Planning: State Tables for Everything
Before generating any image, CANVAS builds a global continuity plan. This plan defines three formal mappings:
C(character, shot) → appearance_state
L(shot) → location_identity
O(object, shot) → object_state
In plain language:
- Character mapping tracks what each character looks like at each point in the story. If Lena wears a blue dress in shots 1-2, switches to a security jacket in shots 3-4, and changes back in shot 5, the plan encodes this explicitly.
- Location clustering groups shots by physical location. Shots 1, 2, 4, and 5 all happen in the museum gallery — even though shots 3 interrupts the sequence in a different room. The system knows they are the same place.
- Object state tracking records how props evolve. The golden artifact transitions from "present" (shots 1-3) to "stolen" (shot 4) to "absent" (shot 5). This prevents ghost objects.

The plan is generated by an LLM processing the full shot descriptions before any image generation begins. It is essentially a structured checklist that says "in this shot, this character should look like this, this room should look like that, and this object should be in this state."
Visual Memory Bank: Anchors That Persist
Planning alone is not enough — the image generator needs visual references, not just text descriptions. CANVAS maintains a persistent visual memory with three components:

- Character anchors (Mc): A reference image for each character in each appearance state. When Lena is in her blue dress, the system stores one clear image of her. When she changes to the security jacket, a new anchor is stored.
- Location anchors (Ml): A background reference for each physical location. The museum gallery gets one anchor image. When the story returns to this location, the system retrieves this anchor instead of generating the room from scratch.
- Frame history (Mf): Previously generated frames, used to maintain spatial continuity in consecutive shots within the same scene.
The retrieval strategy depends on the relationship between the current shot and what came before:
| Situation | What Gets Retrieved | Why |
|---|---|---|
| Same scene, next shot | Previous frame + character anchors | Preserves spatial layout and camera continuity |
| Returning to a location after a gap | Location background anchor + character anchors | Rebuilds the room from stored reference, not from scratch |
| Entirely new location | Character anchors only | No location reference exists yet — one will be created |
This is the part that makes the biggest measurable difference. In CANVAS's ablation experiments, removing background reuse caused a 57% drop in background consistency. Removing character anchors caused a 39% drop in facial consistency.
Scene Revisit: Why "Going Back to the Same Room" Is Hard
This deserves special attention because it is the failure mode that most tools ignore entirely.

When a story cuts away from a location and returns later, standard AI generators have no memory of what the room looked like. They generate a new room that roughly matches the text description but differs in every specific detail — wall color, furniture placement, window positions.
CANVAS solves this with location clustering. The system pre-groups shots by location identity, so when shot 5 returns to the museum gallery, it knows to retrieve the gallery's stored background anchor instead of generating a new room. The result: architectural details stay consistent even after multiple intervening scenes.
In the ablation experiments, removing location grouping reduced background consistency from 4.94 to 2.45 (a 50% drop) for consecutive shots, and from 4.73 to 3.34 for non-consecutive scene reappearances. This was the single highest-impact component in the entire system.
Candidate Selection: Generate Multiple, Pick the Best
For each shot, CANVAS generates multiple candidate frames (typically 3) and scores them against the global plan using a VLM-based evaluator. The scoring checks:
- Does the character's appearance match the plan's expected state?
- Does the background match the location assignment?
- Are props in the correct state?
- Does the image align with the shot description?
The best-scoring candidate is selected. The ablation shows diminishing returns beyond N=3 candidates — character consistency improves meaningfully from N=1 to N=3, but backgrounds and props are already well-handled by the anchor system alone.
Character Consistency: The Anchor Is Everything
The single most impactful component in CANVAS is the canonical character anchor — a reference image that defines what each character looks like. Without it, all three character consistency metrics (face, clothing, body) drop by approximately 39%.


The two panels above illustrate the goal: the same character in different compositions and angles, but unmistakably the same person. The short black hair, round glasses, white lab coat, and ID badge all persist. The laboratory environment with its blue holographic displays stays consistent in the background.
This aligns with what practitioners already know — reference images matter more than prompt descriptions for maintaining identity. But CANVAS adds an important nuance: the anchor should be updated over time. After each frame is generated, the system extracts updated character and location anchors from the result. This means the visual memory evolves as the story progresses, capturing style drift that accumulates during generation.
The Numbers: What the Experiments Show
CANVAS was evaluated on three benchmarks using ContinuityEval, a VLM-based scoring framework (Likert 1-5 scale, κ=0.74 agreement with human judges):
| Metric | CANVAS | Best Baseline | Improvement |
|---|---|---|---|
| Background (non-consecutive) | 4.83 | 4.00 | +21% |
| Background (consecutive) | 4.83 | 4.56 | +6% |
| Prop consistency | 4.88 | 4.56 | +7% |
| Character average | 4.30 | 4.12 | +4% |
On HardContinuityBench (designed specifically for difficult continuity scenarios like long-range scene reappearances and frequent costume changes), the improvements are larger: +27% for non-consecutive backgrounds, +15% for character consistency.
Human preference study (240 pairwise comparisons, 15 annotators): CANVAS achieves 90.9% win-rate for background consistency, 86.7% for character consistency.
What Matters Most: Ablation Rankings
The ablation experiments reveal which components contribute the most:
| Component Removed | Impact |
|---|---|
| Location grouping + background reuse | -50 to -57% background consistency |
| Canonical character anchors | -39% character consistency |
| Prop state updates | -20% prop consistency |
| Character memory updates | -2 to -7% character consistency |
| Candidate selection (QA scoring) | Moderate improvement mainly for characters |
The takeaway is clear: the anchor system is more important than the scoring system. Getting the right reference images into the generation process matters more than selecting among multiple candidates after the fact.
What This Means for AI Storyboard Tools
CANVAS is a research paper, not a product. It runs as a fully automated pipeline with no human interaction. But the techniques it validates have direct implications for interactive storyboard tools.
1. Props Need State Timelines, Not Just Existence Tracking
Most storyboard tools track which props appear in which shots. That is not enough. A prop can be "present," "held," "damaged," "stolen," or "destroyed" — and the visual representation must change accordingly. Tracking state transitions prevents the most jarring continuity errors.
2. Scene Revisit Detection Should Be Automatic
When a user designs a storyboard that returns to a previously shown location, the tool should automatically detect this and pull in the visual reference from the earlier appearance. This is the highest-impact single feature based on CANVAS's ablation data.
3. Characters Have Multiple Appearances
A character is not one static reference image. They change clothes, get injured, age across the story. The system needs to support "appearance states" — the same character with different visual configurations — and know which state applies at each point in the narrative.
4. Visual Consistency Needs Quantitative Evaluation
Checking storyboard consistency by eye is unreliable and slow. CANVAS demonstrates that VLM-based evaluation (scoring face/clothing/body consistency on a 1-5 scale) correlates strongly with human judgment. This can be automated as a quality check before presenting results to the creator.
5. The Human-in-the-Loop Advantage
CANVAS's own limitations section notes that the system "assumes reliable narrative entity extraction," which is "challenging for ambiguous or highly abstract stories." Their future work section explicitly calls for "interactive human-in-the-loop visual storytelling."
This is the gap between a research pipeline and a creative tool. A fully automated system must guess what the director intends. An interactive system can ask. When the director says "I want the room to feel different when the character returns — colder, more empty," a dialogue-driven tool can distinguish intentional changes from continuity errors. An automated pipeline cannot.
Limitations of the CANVAS Approach
The paper is transparent about its constraints:
- Computational cost: Generating 3 candidates per shot and evaluating each with a VLM means roughly 1 + C + S(1 + 3K) API calls per story (C = characters, S = shots, K = candidates). For a 20-shot story with 3 characters and K=3, that is approximately 184 API calls.
- No motion modeling: The system handles static frames, not motion dynamics or complex physical interactions between characters.
- Small benchmark: HardContinuityBench contains only 5 stories. The authors plan to expand it.
References
Mondal, I., Song, Y., Parmar, M., Goyal, P., Boyd-Graber, J., Pfister, T., & Song, Y. (2026). CANVAS: Continuity-Aware Narratives via Visual Agentic Storyboarding. arXiv preprint arXiv:2604.13452. https://arxiv.org/abs/2604.13452
All illustrations in this article were generated using Story2Board's image generation models.