Wan 2.2 Animate 14B Animate Any Photo with Expressions

Introduction

Animating a still image inside a video is now straightforward. With a reference image and a reference video, Wan2.2-Animate-14B can map the identity and expressions from the image onto the person in the video.

In this guide, I set up the model, run the local demo, and animate a character in a real clip. I also explain how the model works under the hood and share practical notes on quality, performance, and memory requirements.

What is Wan2.2-Animate-14B?

Wan2.2-Animate-14B is a large-scale video generation model built for controllable character animation in existing videos. It takes a single reference image (the identity you want) and a reference video (the motion you want) and produces a new video where the subject in the source clip follows the original motion while adopting the identity and expressions from the image.

It extends the Wan 2.2 foundation with a mixture-of-experts architecture that increases model capacity at similar compute, which helps with more complex motion and better control. The model targets high-quality, expression-aware character replacement and animation.

Overview at a Glance

Item	Details
Name	Wan2.2-Animate-14B
Purpose	Character animation and replacement in videos using a single reference image
Model Size	14B parameters (generation backbone)
Training Objective	Flow matching with diffusion-style denoising in latent space
Latent Compression	Custom 3D causal VAE (~127M params), strictly causal over time
Causality	Future frames do not leak into the past for stable motion
Attention	Pooled spatio-temporal self-attention across frames and pixels
Conditioning	Cross-attention to text features via a T5 FMD encoder
Inputs	Reference image + reference video (and optional text conditioning depending on UI)
Output	Video with the subject animated to match the reference image identity and expressions
Interface	Local Gradio demo (app script in the repo)
Tested OS	Ubuntu
Tested GPU	NVIDIA H100 80GB
Memory Notes	~75 GB VRAM recommended for long videos

Key Features

Identity-to-motion animation
- Replace a person in a video with a reference face from a single image.
- Transfer expressions and head dynamics aligned with the source motion.
Capacity without extra compute cost
- Mixture-of-experts design increases capacity for richer motion and detail.
Stable motion from causal design
- 3D causal VAE and temporal causality prevent future-to-past information leak.
Latent-space video generation
- Videos are generated in a compact spatio-temporal latent grid, then decoded.
Practical local workflow
- Run a Gradio demo, upload an image and a reference clip, and generate.

How to Use Wan2.2-Animate-14B

System and Hardware

OS: Ubuntu (tested)
GPU: NVIDIA H100 80GB (tested); high VRAM recommended for long outputs
VRAM: Around 75 GB for longer sequences

Software Prerequisites

Python 3.x
Virtual environment tool (venv or conda)
Git
CUDA-compatible PyTorch stack matching your GPU drivers

Reference Assets

One reference image: the identity you want to animate.
One reference video: the motion you want to inherit.
For better control, start with a video that has a single, clearly visible subject.

Step-by-Step: Install and Launch the Demo

1) Create a virtual environment

Create and activate a fresh environment to keep dependencies clean.
Update pip inside the environment before installing requirements.

2) Clone the repository and enter the folder

Clone the Wan repository that includes the Animate app.
Change directory into the cloned project root.

3) Install requirements from the project root

Install Python dependencies from the requirements file provided in the repo.
Ensure the CUDA-compatible packages match your system.

4) Launch the app script

Run the app script from the project root to start the Gradio demo.
Wait for the local URL to appear in the terminal.

5) Open the demo in your browser

Copy the local URL from the terminal into your browser.
You should see the demo interface ready for inputs.

Animate a Character from an Image and a Video

Prepare Inputs

Upload a reference image: choose a clear, front-facing photo when possible.
Upload a reference video: a clip with a single person and consistent motion improves results.

Generate

Select the animate mode in the interface.
Click Generate and wait for the process to complete. Longer clips will take more time, especially at higher settings.

Review the Output

Expect the model to preserve the background and scene structure.
Identity transfer may vary by facial hair, lighting, and pose differences.
Non-face objects can be imperfectly reconstructed; small props may change or distort.
Expression transfer generally follows the source motion and facial dynamics.

How Wan2.2-Animate-14B Works

Latent Video Generation Pipeline

Wan2.2-Animate-14B is a diffusion transformer video generator trained with flow matching. Rather than working directly in pixel space, it operates in a compact spatio-temporal latent space, then decodes latents to frames.

Raw videos are first compressed by a custom 3D causal VAE into a spatio-temporal latent grid.
The first frame is only spatially downsampled to handle stills clearly, which helps with identity clarity from a single reference image.
The VAE is lightweight (~127M parameters) and strictly causal over time so future frames cannot influence past ones.

Diffusion Transformer and Conditioning

On top of the latent representations, the model applies a stack of diffusion transformer blocks with pooled spatio-temporal self-attention to model dynamics across frames and pixels. This structure helps learn consistent motion, expressions, and scene coherence over time.

Cross-attention to text features is available via a T5 FMD encoder, which provides optional textual conditioning when supported by the interface.
The transformer predicts noise velocity across diffusion time steps using flow matching, iteratively denoising latents into a coherent video.

Capacity and Quality

The Animate model scales the diffusion transformer to around 14 billion parameters for quality and motion complexity. This scaling, combined with the mixture-of-experts design, contributes to stronger control over identity mapping and temporal consistency without a proportional increase in compute per expert.

Results: What to Expect

In practice, the model can keep the original background while replacing the subject’s face and head region to match the reference image. Hair and general clothing patterns can transfer, but fine elements can vary.

Expressions often follow the source video’s facial dynamics and head movement.
Some props can be misinterpreted, especially small objects or low-light details.
Faces can look unusual under certain combinations of pose, lighting, facial hair, or occlusions.
Single-subject videos tend to work better than multi-person scenes.

More Architecture Notes and Memory Considerations

Causality, Normalization, and Feature Caching

To preserve temporal causality and support efficient inference, the model uses RMSNorm in place of group norm. This choice maintains time order and enables a feature cache mechanism that reduces redundant computation across frames.

RMSNorm helps stabilize temporal behavior.
Feature caching speeds up inference by reusing intermediate features when possible.

Chunk-wise Causal Convolutions for Long Videos

The 3D VAE employs chunk-wise causal convolutions with smart cache padding. This design allows the model to generate long videos without memory spikes.

The VAE was bootstrapped from a 2D image VAE, then inflated to 3D and fine-tuned with L1.
Chunking and caching make it feasible to handle extended sequences at high resolution.

VRAM Requirements

Memory usage grows with resolution and duration. For long videos, plan for around 75 GB of VRAM. Shorter clips can run on less memory, but the exact requirement depends on your settings.

FAQs

What inputs do I need?

You need a single reference image of the identity to animate and a reference video that contains the motion you want. Clear faces and single-subject clips improve results.

Does it preserve the original scene?

Yes, the model generally preserves the background and scene layout. The focus is on replacing and animating the subject to match the reference identity and expressions.

How important is the reference video?

Very important. The motion, head turns, and facial dynamics in the output reflect the source video. Use clips with steady motion and adequate lighting.

Can I use text prompts?

The architecture supports cross-attention to text features via a T5 FMD encoder. Text controls depend on the available interface and app options in your environment.

How long does generation take?

Generation time depends on the clip length, resolution, and hardware. Expect longer waits for extended or higher-resolution outputs.

What GPU and memory do I need?

A high-memory GPU is recommended. For long videos, plan for around 75 GB of VRAM. Shorter clips may work with less, but memory needs vary with settings.

Why might the face look off?

Large differences between the reference image and the video (pose, lighting, facial hair, occlusions) can affect identity transfer. Try a clear, front-facing reference image and a video with a single subject.

Why do small objects sometimes change?

Fine details and small props can be harder to reproduce. Low light and motion blur can also lead to altered or ambiguous objects in the output.

Conclusion

Wan2.2-Animate-14B brings identity-driven animation to local workflows: choose a reference image, pick a reference video, and generate a new clip where the subject adopts the identity and expressions from the image while following the original motion. The model’s causal latent pipeline, diffusion transformer with spatio-temporal attention, and mixture-of-experts design contribute to stable motion, expression transfer, and scene preservation.

For best results, use a clear reference image and a video with a single subject, steady motion, and good lighting. Be prepared for high memory usage on long sequences, and expect occasional artifacts on small objects or under challenging conditions. With the local Gradio demo, you can iterate quickly on inputs and settings to reach the quality you need.