Stop Feeding Pose Skeletons to AI: How SCAIL-2 Animates Characters End-to-End

Last week one of our users built a chase-scene storyboard. Five shots, two characters running through a market, one of them clutching a wrapped package. Static images looked great. Then they hit the "image to video" button on shot three.

The runner's right hand fused with the package. By shot four his accomplice had a different jacket. By shot five they were two different actors entirely. The motion was technically there, but the identity, the prop, and the spatial logic had all drifted.

We've watched this happen with every major image-to-video model in the stack — Kling, Sora, Veo, Wan. They each fail in slightly different ways, but the underlying pattern is the same: the model gets a reference image and a target motion, and somewhere between them, fingers melt and characters morph.

So when a paper from Tsinghua and Z.ai dropped in early June arguing that the problem isn't that the models are too small, it's that we're throwing away most of the input before they even see it, we read it carefully. The paper is called SCAIL-2 (Yan et al., 2026), and the core claim is provocative: stop converting driving videos into pose skeletons. Feed the whole video in.

A cinematic film still of a real-life male wushu practitioner mid-motion in a sunlit empty rehearsal hall, captured on 35mm film with golden-hour light and dust particles, a classic film camera with a red recording-light dot visible on the right edge of frame — implying observation and learning without dominating the composition.

What pose-driven animation actually loses

Almost every modern character animation model — Animate Anyone, Wan-Animate, Champ, the original SCAIL — follows the same recipe. Step one: take the driving video, run a pose estimator (ViTPose, SDPose, MHR) over it, and extract a sequence of 2D or 3D skeleton frames. Step two: feed that skeleton sequence plus a reference character image into a diffusion model. Step three: render.

The skeleton is the bottleneck. It's compact, it's clean, it strips away all the visual noise of the source video so the model only has to worry about pose. That's the theory. In practice, skeletons drop a lot of information you can't get back:

Depth ambiguity. Two arms crossing in front of the body project to the same 2D plane. Skeletons collapse 3D occlusion into a tangle of overlapping lines, and the model has to guess which limb is in front.

Environment context. If the source video shows someone walking through a forest brushing branches aside, the skeleton just shows two arms moving. The branches, the friction, the spatial relationship between body and environment — gone.

Object interaction. That wrapped package the runner is carrying? It's not part of the skeleton. The pose estimator doesn't track it. When the model regenerates the scene, it has no obligation to keep it in the right hand.

Non-human subjects. Pose estimators are trained on human anatomy. Try driving an animation with a dog, a robot, or a fantasy creature and the skeleton output collapses or produces nonsense.

The result: pose-driven models look impressive in clean dance demos and absolutely collapse in real cinematic footage with occlusion, props, and crowds.

The end-to-end trick: feed the whole video back in

SCAIL-2's first move is to throw out the pose estimator entirely.

In a latent video diffusion model, the input is a noise tensor z_t that gets denoised over T steps into a clean video latent z_0. Conditioning is added as auxiliary input c — text prompts, reference images, and (in pose-driven methods) pose embeddings. The denoising network learns:

L = E[ || ε - ε_θ(z_t, t, c) ||² ]

In pose-driven methods, c = encoder(pose_skeleton(y)). That's the lossy bottleneck.

In SCAIL-2, c = VAE.encode(y). The driving video itself, in latent form, gets concatenated to the noise sequence and handed to the transformer along with the reference character image. No pose estimator. No intermediate representation. The model sees everything — the actor, the environment, the props, the occlusions, the lighting — and learns to extract whatever it needs.

This is what the paper calls end-to-end in-context conditioning. The "in-context" part matters: there's no separate cross-attention pathway for the driving video, no specialized adapter, no LoRA on top. The video just gets appended to the input sequence, the same way you'd append a few extra tokens to a language model prompt.

The reason nobody tried this earlier isn't that it's hard to code. It's that it's hard to train. To teach a model "given video A featuring character X, produce video A' featuring character Y doing the same motion," you need paired examples. Same motion, different characters, real footage. That data does not exist in any quantity that matters.

SCAIL-2 spends most of its engineering budget solving that data problem. We'll get there.

The seven-channel mask: one model, two modes, six characters

Before the data, there's a model design problem to solve. SCAIL-2 wants to unify three tasks that the field has been treating separately:

Task	What it does
Animation Mode	Reference character + driving video → new video where the character does the driving motion (background follows the reference)
Replacement Mode	Reference character + driving video → new video where the driving video's character is replaced (background follows the driving video)
Multi-character	Multiple reference characters interacting in one scene without their identities bleeding into each other

A naive end-to-end model would confuse these. If the driving video has two people dancing, which reference goes to which dancer? If the reference is a single character but the driving video has a crowd, which body should be replaced?

SCAIL-2's answer is a 7-channel mask layer stacked into the latent. One channel — call it Ch_0 — is an environment switch: when it's set on the reference image's region, the model uses the reference as background; when it's set on the driving video's region, the model uses the driving video as background. That single switch flips Animation Mode and Replacement Mode in the same network weights.

The other six channels are binding slots. Each slot can be assigned to a character. If the driving video shows two dancers and the reference image contains two characters, slot one binds dancer-1 to character-1 and slot two binds dancer-2 to character-2. The motion in each slot only flows to its bound character. You can't accidentally hand the knight's sword swing to the monk.

The masks come from SAM3 (Segment Anything with Concepts), the latest segmentation model. They're stacked along the channel dimension, downsampled to match the latent grid, and concatenated to the input. The model gets the complete visual context plus the masks; the masks are an enhancement, not a replacement.

There's one more piece. Animation Mode and Replacement Mode look almost identical in the input, but the relationship between the reference image and the first frame of the driving video is different. In Animation Mode, the reference and the first frame share nothing — the model has to regenerate everything based on the reference. In Replacement Mode, the reference and the first frame share the entire background — only the character changes.

SCAIL-2 disambiguates by applying a mode-specific RoPE shift to the reference latent in Replacement Mode, giving it a small spatial offset that the model learns to read as "this reference is the background, not the foreground." Same weights, two clean modes.

The data problem (and the reverse trick that solves it)

End-to-end training needs (driving_video, target_video) pairs where the motion is identical and the character is different. Real-world footage doesn't include "the same dance performed by an actor and an animated fox in the same studio." So the authors build a synthetic pipeline.

The pipeline is itself an AI agent system, with four roles operating in a loop around a character-editing canvas:

Candidate Selector — a VLM scans a pool of stock character images and picks the one whose pose, body shape, and framing are closest to the source video's first frame
Prompt Weaver — a VLM writes a detailed editing instruction: "take this character, put them in this exact posture, dress them like this, add this background"
Pose Editor — NanoBanana Pro (Google's image editor) executes the instruction and produces an edited reference image
Quality Checker — another VLM checks the edited image for pose accuracy, environment leakage, and overall quality, looping the editor back for fixes if needed

The output of this loop is a clean reference image. The pipeline then uses an existing pose-driven generator (the original SCAIL, or Wan-Animate) to produce a synthetic video ỹ = G(pose(y), reference) — a video of the new character performing the original motion.

Here's the twist. The pipeline doesn't use ỹ as the target. It uses ỹ as the driving input during training, and the original real video y as the supervision target. They call this reverse driving.

Why backwards? Because the synthetic video ỹ carries all the imperfections of the generator — bad finger articulation, slightly off textures, the small physical implausibilities pose-driven models always have. If you used ỹ as the target, the model would learn those flaws. But if you use ỹ as the driving signal (which only needs to convey motion, not fidelity) and the real y as the supervision target, the model learns to produce real-quality motion from synthetic driving inputs. The generator's noise gets converted from a liability into a regularization signal.

The pipeline yields MotionPair-60K — 59,376 motion-transfer pairs across Animation and Replacement modes, single-character and multi-character. The composition is roughly 60% animation pairs (from SCAIL + Wan-Animate) and 20% replacement pairs (from MoCha), with the remaining 20% being conventional pose-driven samples for diversity.

The bias problem (and how DPO surgically fixes the hands)

There's a stubborn problem the synthetic pipeline can't solve. Pose estimators are inaccurate in specific regions — fingers most of all. When the pose estimator outputs garbage skeletons for hands, the synthetic videos have garbled hands. The end-to-end model trained on those videos inherits the bias: clean bodies, fine faces, but melting fingers.

SCAIL-2's fix is Bias-Aware DPO — direct preference optimization, but applied surgically to just the regions where bias is worst.

The construction of preference pairs is clever. Take a motion video y, extract its pose with an accurate estimator (SDPose), and synthesize a positive sample r = G(P(y), R). To construct a negative, run one extra round of error propagation: take the positive r, re-extract its pose, regenerate with the same reference — r⁻ = G(P(r), R). The negative sample inherits one extra round of pose-estimation error compared to the positive. The gap between r and r⁻ is concentrated in fine-grained articulation, especially fingers.

The DPO objective is then standard, except it's masked to the hand region:

L_DPO^M = σ( -β/2 · (Δ_M(x⁺, t) - Δ_M(x⁻, t)) )

Where Δ_M is the velocity prediction error masked by M — the bounding box of hands in each frame. The result: optimization pressure concentrates exactly on the failure mode.

There's a side effect the authors flag in the discussion: even though the loss only fires inside the hand mask, the underlying flow-matching policy updates globally. So fixing the hands incidentally improves mouth detail and shoulder articulation. The mask doesn't constrain where the policy can move; it just concentrates which gradients dominate.

The numbers

The model is a 14B parameter DiT built on the Wan2.1-14B-I2V backbone, trained for 3,500 steps with batch size 128, then DPO-tuned for another 400 steps. Hardware: 64× H100 for about a week. Concretely expensive.

Results on Studio-Bench single-character animation:

Method	SSIM ↑	PSNR ↑	LPIPS ↓	FVD ↓
SCAIL-2 + NLF-Pose Skeleton	0.6370	18.76	0.2285	282.85
SCAIL-2 + SAM3D-Body Mesh	0.6453	19.09	0.2231	287.11
Wan-Animate	0.6340	18.62	0.2269	305.31
SteadyDancer	0.6386	18.40	0.2273	332.20
UniAnimate-DiT	0.6367	18.52	0.2747	480.15

On Video-Bench automatic evaluation (X-dance dataset), SCAIL-2 scores 4.43 Imaging Quality / 4.38 Appearance Consistency, beating Wan-Animate (3.80 / 4.23) and the original SCAIL (4.27 / 4.25).

The human evaluation is more striking. On cross-identity inference (different reference character than the source actor), SCAIL-2 wins 68.3% to 23.3% against the original SCAIL and 65% to 28.3% against Wan-Animate on motion consistency. Against Kling 3.0 Motion Control — a closed proprietary service — it's roughly even at 36.7% / 23.3% / 40% (win / tie / loss).

The multi-character results are where it really opens a gap. On identity isolation (making sure character A's clothes don't bleed onto character B during overlapping motion), SCAIL-2 beats Wan-Animate 95% to 5% and MultiAnimate 86.7% to 13%. Zero-shot multi-character animation has been a persistent weak point for the field, and this is the first method we've seen that handles it cleanly through architectural design rather than post-hoc cleanup.

Where it falls short

We want to be honest about the deal-breakers.

The training cost is brutal. 64× H100 for a week is somewhere north of $50,000 in raw compute, before engineering salary and iteration overhead. The model weights will be released, but reproducing or fine-tuning the training run is out of reach for anyone smaller than a well-funded lab.

Bias-Aware DPO is hand-only. Lip-sync, micro-expressions, and subtle facial articulation are explicitly listed as future work. If your application depends on dialogue lip-sync — and a lot of cinematic storyboarding does — this paper doesn't help yet.

Long videos are not in the eval. Every demo and every benchmark video is under a few seconds. Whether the model maintains identity over a 30-second shot with continuous motion and scene transitions is unverified. Production storyboard work needs much longer.

The synthetic data ceiling is inherited. SCAIL-2 acknowledges that its training data quality is bounded by the pose-driven generators it uses to synthesize pairs (Wan-Animate, MoCha, original SCAIL). If those generators have systematic biases that the editing pipeline doesn't catch, SCAIL-2 inherits them. The Bias-Aware DPO papers over the worst cases but doesn't fix the source.

Closed-source generators in the loop. NanoBanana Pro for image editing, proprietary VLMs for the agent loop. The pipeline isn't independently reproducible without access to those services.

Why we're not training our own SCAIL-2 (and what we are taking from it)

Storyboard runs an API-router architecture. Every shot's image and every shot's motion comes from a commercial model — NanoBanana Pro for editing, Imagen and Seedream for generation, Sora and Veo and Kling and Wan for image-to-video. We don't train our own diffusion backbone. The trade-off is explicit: we permanently track the frontier of what commercial APIs can do, and we lose the ability to make architectural changes the way SCAIL-2 does.

Our character-consistency story lives entirely at the application layer. A character has a profile (appearance, behavioral signature, dramatic core), a image_history field with multiple reference views (front, side, three-quarter), and an LLM agent that decides which reference to pass to which generator for which shot. When the same character shows up in shot 1, shot 5, and shot 11, the agent passes the same primary reference image through, and the underlying model's reference-conditioning capability does the rest.

This works surprisingly well for single characters. It breaks down in exactly the place SCAIL-2 targets: multi-character interactions where two reference images and one driving signal get confused, and the model has no internal mechanism for binding which motion goes to which character. The application layer can describe "the knight swings the sword while the monk steps back" in prose, but the generator's attention can still tangle them.

There's no clean fix at our layer for that problem. SCAIL-2's binding slots are a real architectural answer, and the only way to use them is to either run SCAIL-2 directly (when weights are released) or wait for closed-source services to incorporate similar mechanisms.

What we can steal directly is the reverse-driving pipeline. We have a film knowledge base with 275K shots, and we're constantly looking for ways to distill that data into training signal for downstream tasks (style tagging, shot composition, scene retrieval). The agentic edit loop — Selector + Weaver + Editor + Checker — is a clean template for building any kind of synthetic paired data, and the reverse-driving idea (use synthetic outputs as inputs, never as targets) generalizes far beyond character animation. We're already prototyping a version for style-consistency fine-tuning where the same synthetic-then-reversed structure applies.

The deeper lesson, the one that took us a few re-reads to internalize, is about where to put the lossy bottleneck. Every pipeline has one — a place where you compress raw signal down to something compact and structured. Pose-driven animation puts the bottleneck in front of the model: video gets compressed to skeleton, then handed in. SCAIL-2 moves the bottleneck inside the model: video goes in raw, and the transformer learns its own compressed representation. The position of that compression matters more than its sophistication. Move it earlier and you cap the model's ceiling. Move it later and you trust the model to find what it needs.

It's a useful frame for thinking about our own stack. Wherever we're doing a heuristic step in front of a generator — pre-formatting prompts, pre-cropping images, pre-selecting references — we should ask whether that step is genuinely helping the model or just bottlenecking what it could otherwise figure out for itself. The answer isn't always "move it inside." Sometimes the heuristic is doing real work. But it's the right question to be asking, and SCAIL-2 is a clean argument for asking it often.

References

Yan, W., Guo, F., Yang, Z., & Tang, J. (2026). SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning. arXiv preprint. arXiv:2606.10804

Cheng, G., Gao, X., Hu, L., et al. (2025). Wan-Animate: Unified character animation and replacement with holistic replication. arXiv preprint arXiv:2511.14055.

Hu, L. (2024). Animate Anyone: Consistent and controllable image-to-video synthesis for character animation. In CVPR, 8153–8163.

Xu, Z., Ma, J., Wang, Z., et al. (2026). End-to-End Video Character Replacement without Structural Guidance (MoCha). arXiv preprint arXiv:2601.08587.

Wallace, B., Dang, M., Rafailov, R., et al. (2024). Diffusion Model Alignment Using Direct Preference Optimization. In CVPR, 8228–8238.

Yan, W., Ye, S., Yang, Z., et al. (2025). SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations. arXiv preprint arXiv:2512.05905.

All illustrations generated using Genkee AI.

Stop Feeding Pose Skeletons to AI: How SCAIL-2 Animates Characters End-to-End

What pose-driven animation actually loses

The end-to-end trick: feed the whole video back in

The seven-channel mask: one model, two modes, six characters

The data problem (and the reverse trick that solves it)

The bias problem (and how DPO surgically fixes the hands)

The numbers

Where it falls short

Why we're not training our own SCAIL-2 (and what we are taking from it)

References

Related Posts

Story2Board: How Attention Manipulation Keeps Characters Consistent Across Storyboard Panels

How Visual Continuity Works in AI Storyboards: Lessons from CANVAS

CineMobile: on-device video diffusion for camera motion

Character Consistency in AI Storyboards: How It Works

Ready to create your storyboard?