Humo AI: ByteDance Human centric Video Generation

Table Of Content
- Introduction to Humo AI
- What is Humo AI?
- Key Features of Humo
- How to Install Humo AI?
- Model Preparation and Download
- How to Use Humo for Video Generation
- Configuring HuMo's Settings
- Running the Generation Process
- FAQs About Humo AI
- What is Humo AI?
- Is Humo AI open source?
- What are the main input modes for Humo?
- What is the maximum video length Humo can generate?
- What video resolutions does Humo support?
- Do I need a powerful GPU to run Humo?
- Can I use Humo without installing it locally?
- How good is the lip-syncing?
- What are the `scale_a` and `scale_t` parameters?
- Where can I download the required models?
Introduction to Humo AI
I recently came across something called Humo, developed by ByteDance, and it immediately captured my attention. The project is described as human-centric video generation via collaborative multimodal conditioning. This represents a very intelligent approach to achieving more controllability in AI video generation. The results produced by this model are impressive.
It can accept photo references, audio references, and text, combining them all into a single, cohesive output. The system appears to be highly steerable. It is also open-source under the Apache 2.0 license. It is built upon the foundation of many other open-source AI projects. Let's take a look.
These are all raw outputs from the model, of course, with references uploaded. It can handle the basic text-to-image with multi-image inputs and combine them all together in the same scene. You can also add text and audio to that mix and control them. The audio samples that were uploaded to the model for testing are not very cinematic or of high quality, so that is something to keep in mind. However, it is important to pay close attention to the lip syncing.
The lip syncing is incredible. It seems to do a great job with character consistency and adding objects in, like the football you see here. Furthermore, you can actually prompt it to control the edits.
What is Humo AI?
Humo AI is a unified, human-centric video generation framework. It is designed to produce high-quality, fine-grained, and controllable human videos from multimodal inputs, including text, images, and audio. Humo supports strong text prompt following, consistent subject preservation, and synchronized audio-driven motion.
Key Features of Humo
Humo offers three distinct modes of operation, each providing a different level of creative control.
Video Generation from Text and Image This allows you to customize character appearance, clothing, makeup, props, and scenes. You use text prompts combined with reference images to achieve this.
Video Generation from Text and Audio This mode generates audio-synchronized videos solely from text and audio inputs. It removes the need for image references and enables greater creative freedom.
Video Generation from Text, Image, and Audio This is for achieving the highest level of customization and control. It works by combining text, image, and audio guidance all at once.
How to Install Humo AI?
Getting started with Humo requires setting up a Python environment and installing the necessary dependencies. Here is a step-by-step guide.
Step 1: Create a Conda Environment
Open your terminal or command prompt and run the following command to create a new environment named humo
with Python 3.11.
conda create -n humo python=3.11
Step 2: Activate the Environment Activate the new environment you just created.
conda activate humo
Step 3: Install PyTorch Install the specific versions of PyTorch, TorchVision, and TorchAudio compatible with CUDA 12.4.
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
Step 4: Install Flash Attention Install the Flash Attention library, which is required for the model.
pip install flash_attn==2.6.3
Step 5: Install Requirements
Install the remaining Python packages listed in the project's requirements.txt
file.
pip install -r requirements.txt
Step 6: Install FFmpeg Install FFmpeg, a crucial tool for handling video and audio files, using Conda.
conda install -c conda-forge ffmpeg
Model Preparation and Download
Before you can generate videos, you need to download the necessary model weights. The following table provides an overview of the required models and where to get them.
Model | Download Link | Notes |
---|---|---|
HuMo-17B | 🤗 Huggingface | Supports 480P & 720P |
HuMo-1.7B | 🤗 Huggingface | To be released soon |
Wan-2.1 | 🤗 Huggingface | VAE & Text encoder |
Whisper-large-v3 | 🤗 Huggingface | Audio encoder |
Audio separator | 🤗 Huggingface | Remove background noise (optional) |
You can download these models using the huggingface-cli
tool. Run the following commands in your terminal.
Download the Wan2.1 model:
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./weights/Wan2.1-T2V-1.3B
Download the main HuMo model:
huggingface-cli download bytedance-research/HuMo --local-dir ./weights/HuMo
Download the Whisper audio encoder:
huggingface-cli download openai/whisper-large-v3 --local-dir ./weights/whisper-large-v3
Download the optional audio separator:
huggingface-cli download huangjackson/Kim_Vocal_2 --local-dir ./weights/audio_separator
How to Use Humo for Video Generation
Our model is compatible with both 480P and 720P resolutions. 720P inference will achieve much better quality.
Some tips:
- Please prepare your text, reference images, and audio as described in
test_case.json
. - We support Multi-GPU inference using FSDP + Sequence Parallel.
- The model is trained on 97-frame videos at 25 FPS. Generating video longer than 97 frames may degrade the performance. We will provide a new checkpoint for longer generation.
Configuring HuMo's Settings
HuMo’s behavior and output can be customized by modifying the generate.yaml
configuration file. The following parameters control generation length, video resolution, and how text, image, and audio inputs are balanced.
Generation Parameters:
frames: <int>
- The number of frames for the generated video.scale_a: <float>
- The strength of audio guidance. A higher value means better audio-motion synchronization.scale_t: <float>
- The strength of text guidance. A higher value means better adherence to text prompts.mode: "TA"
- The input mode. Use"TA"
for text and audio; use"TIA"
for text, image, and audio.height: 720
- The video height (e.g., 720 or 480).width: 1280
- The video width (e.g., 1280 or 832).
Diffusion Parameters:
timesteps:
>sampling:
>steps: 50
- The number of denoising steps. A lower number (30–40) allows for faster generation.
Running the Generation Process
Depending on the input mode you want to use, you will run a different shell script.
- For Text and Audio Input To generate a video using only text and audio inputs, run the following command in your terminal.
bash infer_ta.sh
- For Text, Image, and Audio Input To generate a video using text, image, and audio inputs together, run this command.
bash infer_tia.sh
FAQs About Humo AI
What is Humo AI?
HuMo is a unified, human-centric video generation framework from ByteDance. It is designed to create high-quality, controllable videos of humans using a combination of text, image, and audio inputs.
Is Humo AI open source?
Yes, Humo is released under the Apache 2.0 license. This means it is open source and can be used for both research and commercial purposes.
What are the main input modes for Humo?
Humo supports three primary modes:
- Text-Image: for appearance control
- Text-Audio: for audio-synchronized motion
- Text-Image-Audio: for combined, high-level control
What is the maximum video length Humo can generate?
The current model is trained on 97-frame videos at 25 FPS (just under 4 seconds). Generating longer sequences may lead to a degradation in performance.
What video resolutions does Humo support?
The model is compatible with both 480p and 720p resolutions. Using 720p will result in higher quality output.
Do I need a powerful GPU to run Humo?
Yes. The model, especially the HuMo-17B version, requires a significant amount of VRAM. While specific requirements are not listed, you will need a capable GPU with ample memory to run it locally.
Can I use Humo without installing it locally?
Currently, there are no known hosted versions of Humo available. To use it, you must download the models and run the inference code on your own hardware.
How good is the lip-syncing?
The lip-syncing capability is noted as being incredible and is a standout feature of the model, even when using uploaded audio tracks.
What are the scale_a
and scale_t
parameters?
These are guidance scales in the configuration file:
scale_a
controls the strength of the audio guidance for motion sync.scale_t
controls the strength of the text guidance for prompt adherence.
Where can I download the required models?
All models—including HuMo, Wan-2.1, Whisper, and the audio separator—can be downloaded from Hugging Face using the huggingface-cli
tool and the links provided in this guide.
Related Posts

3DTrajMaster: A Step-by-Step Guide to Video Motion Control
Browser Use is an AI-powered browser automation framework that lets AI agents control your browser to automate web tasks like scraping, form filling, and website interactions.

Bokeh Diffusion: Defocus Blur Control in Text-to-Image Diffusion Models
Bokeh Diffusion is a text-to-image AI model that provides precise control over background blur, known as bokeh, in generated images, using a defocus parameter to maintain scene consistency.

Browser-Use Free AI Agent: Now AI Can control your Web Browser
Browser Use is an AI-powered browser automation framework that lets AI agents control your browser to automate web tasks like scraping, form filling, and website interactions.