Sonu Sahani logo
Sonusahani.com
Video Generation

Humo AI: ByteDance Human centric Video Generation

Humo AI: ByteDance Human centric Video Generation
0 views
8 min read
#Video Generation

Introduction to Humo AI

I recently came across something called Humo, developed by ByteDance, and it immediately captured my attention. The project is described as human-centric video generation via collaborative multimodal conditioning. This represents a very intelligent approach to achieving more controllability in AI video generation. The results produced by this model are impressive.

It can accept photo references, audio references, and text, combining them all into a single, cohesive output. The system appears to be highly steerable. It is also open-source under the Apache 2.0 license. It is built upon the foundation of many other open-source AI projects. Let's take a look.


These are all raw outputs from the model, of course, with references uploaded. It can handle the basic text-to-image with multi-image inputs and combine them all together in the same scene. You can also add text and audio to that mix and control them. The audio samples that were uploaded to the model for testing are not very cinematic or of high quality, so that is something to keep in mind. However, it is important to pay close attention to the lip syncing.

The lip syncing is incredible. It seems to do a great job with character consistency and adding objects in, like the football you see here. Furthermore, you can actually prompt it to control the edits.

What is Humo AI?

Humo AI is a unified, human-centric video generation framework. It is designed to produce high-quality, fine-grained, and controllable human videos from multimodal inputs, including text, images, and audio. Humo supports strong text prompt following, consistent subject preservation, and synchronized audio-driven motion.

Key Features of Humo

Humo offers three distinct modes of operation, each providing a different level of creative control.

Video Generation from Text and Image This allows you to customize character appearance, clothing, makeup, props, and scenes. You use text prompts combined with reference images to achieve this.

Video Generation from Text and Audio This mode generates audio-synchronized videos solely from text and audio inputs. It removes the need for image references and enables greater creative freedom.

Video Generation from Text, Image, and Audio This is for achieving the highest level of customization and control. It works by combining text, image, and audio guidance all at once.

How to Install Humo AI?

Getting started with Humo requires setting up a Python environment and installing the necessary dependencies. Here is a step-by-step guide.

Step 1: Create a Conda Environment Open your terminal or command prompt and run the following command to create a new environment named humo with Python 3.11.

conda create -n humo python=3.11

Step 2: Activate the Environment Activate the new environment you just created.

conda activate humo

Step 3: Install PyTorch Install the specific versions of PyTorch, TorchVision, and TorchAudio compatible with CUDA 12.4.

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

Step 4: Install Flash Attention Install the Flash Attention library, which is required for the model.

pip install flash_attn==2.6.3

Step 5: Install Requirements Install the remaining Python packages listed in the project's requirements.txt file.

pip install -r requirements.txt

Step 6: Install FFmpeg Install FFmpeg, a crucial tool for handling video and audio files, using Conda.

conda install -c conda-forge ffmpeg

Model Preparation and Download

Before you can generate videos, you need to download the necessary model weights. The following table provides an overview of the required models and where to get them.

ModelDownload LinkNotes
HuMo-17B🤗 HuggingfaceSupports 480P & 720P
HuMo-1.7B🤗 HuggingfaceTo be released soon
Wan-2.1🤗 HuggingfaceVAE & Text encoder
Whisper-large-v3🤗 HuggingfaceAudio encoder
Audio separator🤗 HuggingfaceRemove background noise (optional)

You can download these models using the huggingface-cli tool. Run the following commands in your terminal.

Download the Wan2.1 model:

huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./weights/Wan2.1-T2V-1.3B

Download the main HuMo model:

huggingface-cli download bytedance-research/HuMo --local-dir ./weights/HuMo

Download the Whisper audio encoder:

huggingface-cli download openai/whisper-large-v3 --local-dir ./weights/whisper-large-v3

Download the optional audio separator:

huggingface-cli download huangjackson/Kim_Vocal_2 --local-dir ./weights/audio_separator

How to Use Humo for Video Generation

Our model is compatible with both 480P and 720P resolutions. 720P inference will achieve much better quality.

Some tips:

  • Please prepare your text, reference images, and audio as described in test_case.json.
  • We support Multi-GPU inference using FSDP + Sequence Parallel.
  • The model is trained on 97-frame videos at 25 FPS. Generating video longer than 97 frames may degrade the performance. We will provide a new checkpoint for longer generation.

Configuring HuMo's Settings

HuMo’s behavior and output can be customized by modifying the generate.yaml configuration file. The following parameters control generation length, video resolution, and how text, image, and audio inputs are balanced.

Generation Parameters:

  • frames: <int> - The number of frames for the generated video.
  • scale_a: <float> - The strength of audio guidance. A higher value means better audio-motion synchronization.
  • scale_t: <float> - The strength of text guidance. A higher value means better adherence to text prompts.
  • mode: "TA" - The input mode. Use "TA" for text and audio; use "TIA" for text, image, and audio.
  • height: 720 - The video height (e.g., 720 or 480).
  • width: 1280 - The video width (e.g., 1280 or 832).

Diffusion Parameters:

  • timesteps: > sampling: > steps: 50 - The number of denoising steps. A lower number (30–40) allows for faster generation.

Running the Generation Process

Depending on the input mode you want to use, you will run a different shell script.

  1. For Text and Audio Input To generate a video using only text and audio inputs, run the following command in your terminal.
bash infer_ta.sh
  1. For Text, Image, and Audio Input To generate a video using text, image, and audio inputs together, run this command.
bash infer_tia.sh

FAQs About Humo AI

What is Humo AI?

HuMo is a unified, human-centric video generation framework from ByteDance. It is designed to create high-quality, controllable videos of humans using a combination of text, image, and audio inputs.

Is Humo AI open source?

Yes, Humo is released under the Apache 2.0 license. This means it is open source and can be used for both research and commercial purposes.

What are the main input modes for Humo?

Humo supports three primary modes:

  • Text-Image: for appearance control
  • Text-Audio: for audio-synchronized motion
  • Text-Image-Audio: for combined, high-level control

What is the maximum video length Humo can generate?

The current model is trained on 97-frame videos at 25 FPS (just under 4 seconds). Generating longer sequences may lead to a degradation in performance.

What video resolutions does Humo support?

The model is compatible with both 480p and 720p resolutions. Using 720p will result in higher quality output.

Do I need a powerful GPU to run Humo?

Yes. The model, especially the HuMo-17B version, requires a significant amount of VRAM. While specific requirements are not listed, you will need a capable GPU with ample memory to run it locally.

Can I use Humo without installing it locally?

Currently, there are no known hosted versions of Humo available. To use it, you must download the models and run the inference code on your own hardware.

How good is the lip-syncing?

The lip-syncing capability is noted as being incredible and is a standout feature of the model, even when using uploaded audio tracks.

What are the scale_a and scale_t parameters?

These are guidance scales in the configuration file:

  • scale_a controls the strength of the audio guidance for motion sync.
  • scale_t controls the strength of the text guidance for prompt adherence.

Where can I download the required models?

All models—including HuMo, Wan-2.1, Whisper, and the audio separator—can be downloaded from Hugging Face using the huggingface-cli tool and the links provided in this guide.

Related Posts