Sonu Sahani logo
Sonusahani.com
AI

What is HunyuanVideo 1.5 Text-to-Video?

What is HunyuanVideo 1.5 Text-to-Video?
0 views
9 min read
#AI

After a brief pause, new Chinese video models are rolling out again. HunyuanVideo-1.5 is a lightweight video generation model that I installed locally and tested on standard hardware. My goal was simple: set it up on Ubuntu, check the actual VRAM usage, and generate videos from text and reference images to see how far it has come.

I ran everything on an NVIDIA RTX A6000 (48 GB VRAM). The model card specifies a minimum of 14 GB of VRAM, so I monitored memory closely during generation. The first run downloaded a large checkpoint—around 33 GB—so disk space matters. Once the UI launched, I tested several prompts and inputs to evaluate quality, speed, and prompt following.

What is HunyuanVideo-1.5?

HunyuanVideo-1.5 is a video generation model designed to create videos from text or image inputs on consumer GPUs. It targets professional-grade visual quality and uses a diffusion transformer architecture paired with a 3D variational autoencoder.

At its core, the model has 8.3 billion parameters and introduces selective and sliding tile attention to reduce redundant spatiotemporal computation. It supports bilingual text understanding (Chinese and English) and integrates multiple encoders to handle text, glyph-based text, and visual input. The generation process follows a two-stage strategy: base synthesis and super-resolution.

HunyuanVideo-1.5 Overview

ItemDetails
NameHunyuanVideo-1.5
TypeText-to-video and image-to-video generation
Parameters~8.3B
Checkpoint Size~33 GB (first download)
GPU VRAM RequirementMinimum ~14 GB (observed ~15 GB during runs)
Inference InterfaceGradio web UI (localhost:7860)
Supported Input ModalitiesText, text with reference image
Language SupportChinese and English prompts
ArchitectureDiffusion transformer + 3D VAE
Attention MethodsSelective and sliding tile attention
Positional Encoding3D rotary positional encoding
Generation StrategyTwo-stage: base generation + SR DIT for super-resolution

Key Features of HunyuanVideo-1.5

Architecture Highlights

  • Diffusion transformer backbone with ~8.3B parameters.
  • 3D variational autoencoder for spatiotemporal latent representation.
  • Selective and sliding tile attention to prune redundant computations across space and time.
  • 3D rotary positional encoding inside dual-stream transformer blocks.
  • Iterative denoising with a super-resolution stage (SR DIT) that refines and enhances features.

Multimodal Processing Pipeline

  • Multiple input encoders:
    • A multimodal large language model with a token refiner for richer text understanding.
    • Glyph ByT5 with a ByT5 projector for bilingual glyph-based text encoding.
    • SigLIP with a vision projector for visual feature extraction.
    • A video encoder with patch embeddings to convert frames into latent representations.
  • Dual-stream blocks mix self-attention and sparse attention, bridged by residual connections and MLPs to merge and refine multimodal information.
  • The design is modular, feeding all encoded features forward into a coherent synthesis process.

Two-Stage Generation

  • Base diffusion transformer produces a low-resolution latent video from pure noise guided by the prompt and encoders.
  • Features are upsampled and concatenated with higher-resolution features.
  • SR DIT progressively cleans and enhances the latent until the final video emerges.

Setup and Local Installation

I used Ubuntu with a single NVIDIA RTX A6000 (48 GB VRAM). The model card’s 14 GB VRAM guidance was accurate in practice. A clean Python environment is recommended.

System Used

  • OS: Ubuntu
  • GPU: NVIDIA RTX A6000 (48 GB VRAM)
  • VRAM usage observed: ~15 GB during generation
  • Storage: Ensure at least 40 GB free for checkpoints and caches

Step-by-Step Installation

  • Create and activate a virtual environment:
    • python3 -m venv venv
    • source venv/bin/activate
  • Clone the HunyuanVideo repository:
    • git clone repo
    • cd repo
  • Install dependencies:
    • pip install -r requirements.txt
  • Launch the Gradio app:
    • python sub.py

The first run will download the main checkpoint (around 33 GB). Pick a released checkpoint to start. Keep the terminal open while it initializes and applies any patches.

First Launch Notes

  • Access the UI at http://localhost:7860.
  • If you see a blank or loading UI, wait for the initial checkpoint download to complete.
  • Disk and network speed can make the first launch take time.
  • VRAM usage scales with active models and batch settings, but remained steady in my tests even when increasing steps and guidance.

Running the Gradio App

Interface Basics

  • Upload a reference image if you want image-to-video generation.
  • Enter a prompt describing the motion or action.
  • Adjust steps and guidance (CFG) to trade off speed and fidelity to the prompt.
  • Use default settings first, then only tweak what’s necessary.

VRAM Usage and Speed

  • VRAM usage stayed around 15 GB during generation, just above the stated 14 GB requirement.
  • With the model fully loaded, increasing steps and guidance did not cause large VRAM spikes.
  • Generation time scaled with steps:
    • Around 2 minutes for quicker runs.
    • 10–20+ minutes when steps and guidance were maxed.

Test 1: Image-to-Video — Talking Head

Settings

  • Input: A reference image (portrait).
  • Prompt: “A man talking into a microphone, looking at the camera.”
  • Steps: Default.
  • Guidance: Default.

Results

  • The video generated in under 2 minutes.
  • VRAM held steady near 15 GB.
  • Notable improvements compared to earlier iterations:
    • Eye movement felt more consistent.
    • Facial expressions were maintained better.

Test 2: Image-to-Video — Object Interaction

Settings

  • Input: An image of a model holding a banana.
  • Prompt: “A girl peeling a banana and smiling.”
  • Steps: Increased to 20.
  • Guidance (CFG): Increased to improve prompt following.
  • Other settings: Defaults left unchanged.

Results

  • Longer runtime than Test 1, as expected with higher steps and guidance.
  • VRAM stayed close to the same ~15 GB range.
  • Observations:
    • Facial expressions were solid.
    • Teeth rendering needs improvement but was not malformed.
    • The banana size changed slightly during motion.
    • Hands were acceptable; minor issues in finger visibility.
    • Lighting enhancements appeared on the face compared to the source image.

Overall, the model retained the original image identity and added motion with reasonable quality for this scenario.

Test 3: Text-Guided Precision — Clock Hands

Settings

  • Input: An image of a clock.
  • Prompt: Move the hands to 3:45.
  • Steps: 50 (maximum in this test).
  • Guidance (CFG): 5.
  • Expectation: Precise control over small object motion.

Results

  • Runtime: About 20 minutes.
  • VRAM: Around 15 GB.
  • Outcome:
    • The model translated the prompt correctly internally (Chinese translation matched 3:45).
    • The generated video did not follow the instruction accurately; only minor movement occurred.
    • High steps and guidance did not resolve the control issue for this case.

This scenario exposed limitations in precise, fine-grained control of object states and positioning.

Observations and Tips

  • VRAM and Runtime

    • Expect ~15 GB VRAM usage once loaded, even with higher steps and guidance.
    • Steps affect time more than memory; plan for longer runs at higher values.
  • Prompting

    • Short, clear prompts work best.
    • Raising guidance helps with adherence but has diminishing returns for complex spatial tasks.
  • Quality Controls

    • 20 steps provided a reasonable balance between time and quality in many cases.
    • Super-resolution refinement is built in; no need to add external upscalers during initial tests.
  • Input Images

    • Identity and facial consistency are maintained well.
    • Minor artifacts may appear in teeth and hands; these are common trouble spots for many models.
  • Languages

    • The model processes Chinese and English prompts.
    • Internal prompt handling may involve translation or refinement steps.

Limitations

  • Precise spatial control over small elements (e.g., clock hands) remains challenging, even with high steps and guidance.
  • Teeth and hand details still need improvement in some outputs.
  • First-time setup is heavy on storage due to the ~33 GB checkpoint.
  • Real-time iteration is not practical at high step counts; expect longer waits for complex prompts.

Practical Setup Checklist

  • Hardware

    • GPU with at least 14 GB VRAM.
    • Sufficient disk (40+ GB free for checkpoints and caches).
    • Stable network connection for the initial download.
  • Software

    • Ubuntu or a compatible Linux environment.
    • Python 3 with virtual environment setup.
    • Latest NVIDIA drivers and CUDA stack appropriate for your environment.
  • Project

    • Clone the HunyuanVideo repo.
    • Install requirements in a clean virtual environment.
    • Launch the Gradio UI with python sub.py.
    • Keep terminal open for logs during the first run.

Basic Usage Guide

  • Text-to-Video

    • Enter a concise prompt describing the scene and motion.
    • Start with steps ~20 and default guidance.
    • Increase guidance gradually only if the output ignores key prompt elements.
  • Image-to-Video

    • Upload a reference image (portrait or object scene).
    • Describe the motion you want added to the image.
    • Keep steps around 20 for quick iterations, then adjust if needed.
  • Performance Tuning

    • If runs are too slow, reduce steps.
    • Keep batch size at 1.
    • Avoid unnecessary toggles in advanced settings for initial trials.

Troubleshooting Notes

  • Model Not Loading

    • Confirm disk space and re-run python sub.py.
    • Check that requirements installed successfully.
  • UI Not Accessible

    • Verify the port (default 7860).
    • Ensure the process is running and no firewall is blocking local access.
  • Memory Errors

    • Close other GPU processes.
    • Restart the session and lower resolution or steps if configurable.
  • Outputs Ignore Prompt Details

    • Increase guidance in small increments.
    • Rephrase the prompt for clarity.
    • Test with fewer constraints first, then add details gradually.

Conclusion

HunyuanVideo-1.5 runs locally with about 14–15 GB of VRAM and produces solid results for text-to-video and image-to-video tasks, especially with faces and general motion. The installation is straightforward, the Gradio interface is easy to work with, and VRAM use remained stable across different settings in my tests.

Quality improves with more steps and guidance, but run times increase and certain control tasks—like precise clock-hand positioning—still fall short. For image-driven motion and expressive talking heads, it shows clear progress over earlier versions. For fine-grained object control, it has room to grow.

If you have a GPU in the 14–16 GB range, you can run it locally and expect workable generation times at moderate settings. Keep prompts clear, start with default parameters, and adjust steps and guidance only as needed.

Related Posts