film-theory

Story2Board: How Attention Manipulation Keeps Characters Consistent Across Storyboard Panels

Genkee Team··11 min read
Story2Board: How Attention Manipulation Keeps Characters Consistent Across Storyboard Panels

Last month we generated a 7-panel storyboard for a desert adventure story. The nomad character had red robes, a lantern, and a distinctive scar across his left cheek. By panel 4, the robes were blue. By panel 6, the scar had switched sides. Panel 7 looked like a completely different person.

We've been building a storyboard tool for over a year now, and this problem — the same character looking different in every panel — is the one that keeps us up at night. So when a paper from Hebrew University crossed our desk claiming to fix it without any training or fine-tuning, we were skeptical. Then we read the actual method. It's clever enough that we think it deserves a proper breakdown.

The paper is called Story2Board (Dinkevich et al., 2025).

A cinematic 4-panel storyboard showing the same red-haired girl character in four different dramatic scenes — standing on a cliff, exploring a cave, running through rain, and reading by candlelight — with perfectly consistent appearance across all panels.

Why characters drift

Every diffusion model starts from pure noise and refines it into an image. Each generation is independent. There's no memory of what came before. When you ask for "a red-haired girl on a cliff" and then "the same red-haired girl in a cave," the model doesn't know what "same" means. It interprets the text from scratch and produces a different person who happens to match the description.

A split comparison showing inconsistent character appearance across panels versus consistent character appearance — same boy in different scenes, with and without consistency mechanisms.

The existing solutions all have tradeoffs we don't love:

Fine-tuning (LoRA, DreamBooth) works well but you need training data and GPU time for every new character. That's fine for a hero character you'll use in 50 panels. It's overkill for a side character who appears in three.

Reference image conditioning (IP-Adapter, OmniControl) feeds a reference photo as input. The problem is that the model tends to over-copy the reference — you get the same character, but also the same pose, the same framing, sometimes even the same background. The panels end up looking like minor variations of one image.

Shared self-attention (StoryDiffusion) modifies the attention mechanism so panels share features. It helps, but it can collapse diversity. Every panel starts looking the same in ways beyond just the character.

Story2Board does something different: it reaches into the model's attention layers during inference and manipulates them. No weight changes. No architecture changes. Just inference-time intervention.

The two-panel trick (Latent Panel Anchoring)

The core idea is almost absurdly simple. Instead of generating each panel independently, Story2Board generates two-panel images. The top half is always the same reference panel — the character in a neutral pose. The bottom half is the actual scene.

A technical illustration showing how Latent Panel Anchoring works — a reference panel at the top with a knight character, connected by glowing arrows to three scene panels below showing the same knight in different environments.

During denoising, at every step, the bottom panel's latent representation gets replaced with the reference panel's latent. Think of it as a reset pulse: the scene evolves freely — background, composition, lighting all change — but the character features keep getting pulled back toward the reference. Like an elastic band attached to the character's face.

For each denoising step t:
  1. Run self-attention on the two-panel image
  2. Replace bottom panel's latent with reference latent
  3. Continue

After denoising finishes, you crop out the bottom panel. That's your storyboard frame.

It's the kind of trick that makes you think "why didn't anyone try this sooner." And to be fair, it doesn't fully work on its own. The broad strokes are right — face shape, hair color, body proportions — but fine details like clothing texture, accessories, and the exact shade of the character's jacket can still drift between panels.

Attention mixing for the fine details (RAVM)

This is where Story2Board gets more interesting. Reciprocal Attention Value Mixing (RAVM) finds pairs of tokens that "look at each other" across panels, and blends their features.

An abstract visualization of attention-based feature mixing between two panels of a wizard character — luminous threads connecting matching regions like face-to-face, hat-to-hat, and robe-to-robe between panels.

The logic goes like this. In a transformer-based diffusion model, each token attends to every other token. Some of those attention connections cross the boundary between the reference panel and the scene panel. RAVM looks for reciprocal pairs — cases where token A in the reference attends strongly to token B in the scene, AND token B attends strongly back to token A. These mutual connections tend to land on semantically matching regions: face-to-face, hat-to-hat, jacket-to-jacket.

For each reciprocal pair, RAVM blends the value vectors:

V'_v = λ · V_v + (1 - λ) · V_u*

λ is set to 0.5 in the paper. The key design choice is that RAVM only touches values — not keys or queries. Keys and queries control where the model looks (spatial layout, composition). Values control what it sees (texture, color, appearance). By leaving keys and queries alone, RAVM preserves the scene's layout while transferring the character's visual features from the reference.

We think this is the most transferable idea in the paper. The two-panel trick (LPA) is specific to Flux's architecture. But the idea of finding reciprocal attention pairs and mixing their values could work in any transformer-based diffusion model. Whether API providers will ever expose that level of control is a different question.

A benchmark that actually measures what matters

Most storyboard benchmarks only measure one thing: does the character look the same across panels? But that's not enough. A model that copies the reference panel pixel-for-pixel would score perfectly on identity preservation — and produce a terrible storyboard. Real storyboards need the character in different poses, at different scales, in different environments.

Story2Board introduces the Rich Storyboard Benchmark: 100 stories, 7 panels each, generated by GPT-4o with instructions to emphasize visual diversity. And alongside it, a Scene Diversity metric that measures how much the character's position, size, and pose actually change across panels. It uses Grounding DINO for bounding box detection and ViTPose for pose estimation.

A 6-panel storyboard grid showing the same fox character in dramatically different scenes and compositions — from extreme close-up to bird's eye view, from waterfall to library — illustrating scene diversity while maintaining character identity.

This metric formalizes a tension we feel every day: consistency and diversity pull in opposite directions. Push too hard on consistency and your panels look like the same image with different backgrounds pasted in. Push too hard on diversity and your character turns into a different person every frame. The score you want isn't maximum consistency — it's the best ratio of consistency to diversity.

The numbers

Story2Board was tested against StoryDiffusion, IC-LoRA (two variants), OmniControl, and StoryGen. The results:

MetricStory2BoardBest baseline
Character consistency (DreamSim)0.700.65 (Flux base)
Prompt alignment (VQAScore)0.550.60 (Flux base)
Scene diversity0.360.28 (OmniControl)

One thing jumped out at us: vanilla Flux without any consistency mechanism scores 0.65 on character consistency. That's high. But it achieves this by generating nearly identical images — the characters look the same because everything looks the same. Story2Board hits 0.70 while maintaining much higher scene diversity (0.36 vs 0.28).

The Amazon Mechanical Turk study (500 comparisons, 3 raters each) tells a more nuanced story. Story2Board won on overall preference, but lost to OmniControl on background richness and to IC-LoRA on raw character consistency. Its advantage is holistic — the storyboards feel like coherent stories, even if individual dimensions aren't always the best.

Where it falls short

We want to be direct about the limitations, because some of them are deal-breakers depending on your use case.

It only handles one character. Every experiment in the paper has a single protagonist. Two characters who both need to stay consistent? Not addressed. This is a real problem — most stories have more than one person in them.

Flux only. The method relies on Flux's dual-stream transformer architecture. The authors don't test it on Stable Diffusion 3, DALL-E, or anything else. We can't use it with our current model stack (NanoBanana, Kling, Seedream) without significant adaptation work.

It amplifies bad attention patterns. This one's subtle but important. When Flux's internal attention already confuses two visual elements — say, a raccoon's tail and a fairy's wings getting entangled — RAVM makes it worse by propagating the confusion. It reinforces whatever patterns exist, good or bad.

No story logic. Story2Board doesn't track props, locations, or narrative state. If a character picks up a sword in panel 2, there's nothing ensuring the sword appears in panels 3 through 7. For that kind of continuity you need something like CANVAS, which operates at the story level rather than the pixel level.

Two different problems wearing the same hat

Here's something we've come to believe after working on this for a while: character consistency isn't one problem. It's two.

A layered architectural diagram showing the semantic layer (story logic, character cards, prop states) and the pixel layer (diffusion attention, latent spaces, feature mixing) connected by a bridge of light.

There's a semantic layer: knowing what should be true. Which characters are present, what they're wearing, where the props are, what the emotional beat is. CANVAS handles this with state tables. Our storyboard engine handles it with an LLM agent that plans scenes and tracks character states.

And there's a pixel layer: making sure what gets generated matches. The character in panel 5 needs to visually look like the character in panel 1 — same hair texture, same jacket pattern, same face. Story2Board's LPA and RAVM operate here.

You need both. Story2Board without semantic tracking can produce panels where the character looks identical but a stolen artifact magically reappears. CANVAS without pixel-level consistency can produce panels where the story is perfectly coherent but the main character looks like five different people.

We don't think any single system solves both layers today. But the fact that Story2Board can handle the pixel layer without any training — just inference-time attention manipulation — is a meaningful step. It means consistency doesn't have to be baked into the model. It can be applied at generation time, on any transformer-based architecture, in principle.

Whether that principle translates to practice, with real API constraints and models we don't control the internals of, is the question we're still working through.


References

Dinkevich, D., Levy, M., Avrahami, O., Samuel, D., & Lischinski, D. (2025). Story2Board: A Training-Free Approach for Expressive Storyboard Generation. arXiv preprint. arXiv:2508.09983

Mondal, A., Shin, D., Tran, H. P., Bao, Y., Bansal, M., & Essa, I. (2026). CANVAS: Coherent and Aligned Visual Narratives from Automated Story Synthesis. arXiv preprint. arXiv:2604.13452

Zhou, Y., Zhou, D., Cheng, M.-M., Feng, J., & Hou, Q. (2024). StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation. arXiv preprint. arXiv:2405.01434


All illustrations generated using Genkee AI.

Related Posts

Ready to create your storyboard?

Turn your ideas into professional storyboards with Story2Board — the intelligent director assistant.

Try Story2Board Free