Sonu Sahani logo
Sonusahani.com
AI

Install Qwen3-Omni Thinking with vLLM: Text, Image, Audio, Video

Install Qwen3-Omni Thinking with vLLM: Text, Image, Audio, Video
0 views
13 min read
#AI

Introduction

Qwen has released three variants of its Qwen3-Omni family. I previously covered the Omni Instruct model. In this article, I install the “Thinking” variant locally with vLLM, explain what it is, outline its architecture, and share results from a multilingual, multimodal test combining text, image, audio, and video. I also highlight current limitations, resource needs, and practical tips for serving it on your own machine.

What Is Qwen3-Omni “Thinking”?

Qwen3-Omni “Thinking” is a natively multimodal foundation model designed for real-time, voice-forward interaction across:

Install Qwen3-Omni Thinking with vLLM: Text, Image, Audio, Video
  • Text
  • Images
  • Audio
  • Video

It was trained with early text-first pretraining and mixed multimodal data, aiming to retain its core NLP and vision strengths while adding strong audio and audiovisual abilities. The model emphasizes multilingual performance:

  • Around 119 text languages supported
  • Speech input in 19 languages (see the model card for the list)
  • Spoken output in 10 languages

In practice, that means you can prompt, converse, and transcribe across languages and even switch mid-dialogue. The model separates reasoning and speech generation, which helps it maintain complex cross-modal reasoning while producing fluent, low-latency audio responses.

Why I’m Installing the Thinking Variant

My goal here is to install and serve Qwen3-Omni Thinking locally with vLLM, then run a realistic test that combines multiple modalities at once. I focus on setup, resource usage, and the model’s multilingual behavior on a translation task supplied through audio, video, and image inputs. I’ll explore more of the architecture in future work, but this piece stays aligned with installation and initial results.

System and Setup

I used Ubuntu with an NVIDIA H100 (80 GB VRAM). For a smooth experience with large context lengths and multimodal inputs, you’ll want a high-memory GPU in this class. The model alone consumed about 60 GB VRAM at load on my system. During inference with KV cache growth (since VRAM holds both KV cache and model weights), usage rose into the mid-70 GB range.

If you run out of memory, reduce the maximum context length when serving the model.

Installing with vLLM

I installed and served the model using vLLM. The basic process:

  1. Clone vLLM
  2. Install requirements
  3. Install Transformers
  4. Install the Qwen Omni utilities listed on the model card
  5. Serve the model with your desired context length and precision

This sequence takes a few minutes on a typical setup.

Step-by-Step (High-Level)

  • Step 1: Clone vLLM and install dependencies

    • Clone the vLLM repository
    • Install vLLM and its requirements
    • Install a recent Transformers version
    • Install the Qwen Omni utilities referenced in the model card
  • Step 2: Prepare your environment

    • Confirm your GPU drivers and CUDA/cuDNN are working
    • Ensure sufficient VRAM (80 GB is a practical baseline for large contexts)
    • Confirm Python 3.10+ (or the version recommended by vLLM)
  • Step 3: Serve the model

    • Start vLLM with:
      • max context length around 32,000 (adjust based on VRAM)
      • bfloat16 precision (2-byte tensor storage)
      • permission for local media paths if you plan to load local files (images, audio, video)
  • Step 4: Connect a client

    • Use vLLM’s OpenAI-compatible endpoint
    • Send text, image, audio, and video inputs together as your task requires
    • If you hit an OOM error, lower the max context length and retry

I set context length to about 32,000 and dtype to bfloat16. If VRAM is tight, reduce the context length. Downloading the model will fetch multiple shards; just let vLLM complete the process.

Table Overview

Here’s a quick overview of the points covered so far:

ItemDetails
ModelQwen3-Omni “Thinking”
Modality inputsText, images, audio, video
Modality outputsText, speech (see current limitations below)
Multilingual text~119 languages
Speech input19 languages (see model card)
Speech output10 languages
ReasoningThinker–Talker split with Mixture-of-Experts routing
Speech generationStreaming codec decoder with multi-token prediction (MTP)
Precision used herebfloat16
Max context length (tested)~32,000 tokens (reduce if OOM)
GPU used in testingNVIDIA H100 80 GB
VRAM observed~60 GB at load; mid-70 GB during inference
Serving stackvLLM + Transformers + Qwen Omni utilities
Current limitationAudio output not available via vLLM/Transformers at the time of testing

Key Features

  • Truly multimodal inputs: You can pass text, image(s), audio, and video together in one request.
  • Multilingual focus: Robust support for a large set of text languages, with speech input and output across a subset.
  • Thinker–Talker separation:
    • Thinker: handles multi-modal tokens (text embeddings, vision encoder outputs, audio codec embeddings) and long-context reasoning.
    • Talker: drives natural speech generation through a streaming codec decoder.
  • Mid-layer hidden extraction: A mid-layer taps the Thinker’s internal states and feeds them to the Talker for speech generation.
  • MTP (multi-token prediction): Speeds up speech synthesis with low latency.
  • Unified input tokens: Modality-specific hidden tokens (text, vision, audio codec) get packed into one sequence; outputs can be text or speech with configurable voices.

Architecture: Thinker and Talker

Qwen3-Omni Thinking uses a Mixture-of-Experts (MoE) transformer split into two functional components:

Thinker

  • The Thinker performs multi-step reasoning over long contexts.
  • It ingests:
    • Text embeddings
    • Vision encoder outputs
    • Audio codec embeddings
  • It maintains the internal reasoning state across modalities.

Talker

  • The Talker turns internal states into natural speech via a streaming codec decoder.
  • It uses MTP to generate multiple tokens per step for faster speech synthesis and lower latency.
  • Outputs are configurable for voices and languages.

Mid-Layer Hidden Extraction

A mid-layer “hidden extraction” taps into the Thinker’s internal states and routes them into the Talker. This keeps reasoning and speech generation synchronized while allowing the Thinker to focus on cross-modal context and the Talker on fluent audio.

Unified Inputs and Flexible Outputs

Inputs are packed as modality-specific hidden tokens into one sequence. The model can output text or speech. In multilingual settings, you can converse in one language and have output spoken in another. You can also switch languages mid-dialogue. In short:

  • Input types: text, image, audio, video
  • Output types: text, speech
  • Voice: configurable speakers for audio responses
  • Language: flexible prompting across many languages

Current Limitation: Audio Output in vLLM/Transformers

At the time of testing, the code released in vLLM and Transformers did not produce audio output locally. I tried for about an hour with no success. For now, I proceeded with text output only. Once the required changes land in vLLM or Transformers, audio output should become available. I’ll revisit this in future work as soon as the serving stack supports it.

Serving the Model with vLLM

I started the server with:

  • Context length around 32,000
  • bfloat16 precision
  • Local media path allowed for loading images, audio, and video from disk

The download included multiple shards. After the model loaded, VRAM usage sat around 60 GB. During inference, VRAM rose into the mid-70 GB range due to KV cache growth. If you only have 80 GB available and encounter OOM errors, reduce the context length and restart the server.

How I Tested It

Qwen3-Omni models are best evaluated with multiple modalities together. I combined:

  • An audio file with a translation prompt
  • A video file with imagery representing a multilingual theme
  • An abstract image also themed around multilinguality

I asked the model to translate a sentence into a range of languages in the same run, feeding the audio, image, and video together. I used the text output path for now due to the audio output limitation mentioned earlier.

Resource Checks During Inference

While running the test, VRAM usage hovered in the mid-70 GB range. That aligned with expectations for the chosen context length. If you get OOM on a similar GPU, lower the max context length.

Results: Multilingual Translation Behavior

The model returned a structured response. It identified the sentence from the audio and proceeded with translations into the requested languages. I reviewed the output and saw a mixed picture:

  • It did a reasonable job on several languages.
  • In some outputs (for example, Hindi, Korean, Bengali), it occasionally inserted English words within the translation. That degrades the quality of a strictly literal translation.
  • Chinese outputs looked stronger and more consistent.
  • The model sometimes described what it was doing in English instead of sticking to pure translation. That’s not helpful for a literal translation request.

An interesting detail: I intentionally listed one language twice. The model noticed the duplication and flagged it. The final answer section was much cleaner, with more literal translations and fewer insertions of English, and it correctly pointed out the duplicate language.

Overall, it handled the multilingual request adequately, with clear room for improvement on literal-only outputs for some languages. The Instruct variant did a bit better here in my earlier testing. The Thinking variant appears to need more refinement for strict translation tasks, although its final formatted output was solid.

Key Takeaways from the Test

  • Single multimodal request: You can send text, image, audio, and video together and ask for one integrated answer.
  • Multilingual capacity: It can translate across many languages, though some combinations showed mixed quality when strict literal output is expected.
  • Stronger on Chinese: The model tended to produce more consistent output for Chinese.
  • Final answers improved: The last formatted section was cleaner and more literal, and it flagged duplicated items correctly.
  • Audio output not yet available locally via vLLM/Transformers: For now, plan on text output for local runs. Audio output will come once serving support lands.

How to Use Qwen3-Omni Thinking Locally

Below is a practical, step-by-step outline aligned with what I did. Keep your model card open for any repository-specific details.

Prerequisites

  • Linux system (Ubuntu or similar)
  • NVIDIA GPU with large VRAM (80 GB recommended for large contexts)
  • Recent NVIDIA drivers, CUDA, and cuDNN
  • Python environment (virtualenv or conda recommended)

Installation Steps

  1. Clone vLLM
    • Get the vLLM repository.
  2. Install vLLM and dependencies
    • Install the package and its requirements.
  3. Install Transformers
    • Use a recent version that matches the model’s requirements.
  4. Install Qwen Omni utilities
    • Follow the model card for the exact package names and versions to install.
  5. Verify GPU access
    • Confirm CUDA is detected and vLLM can see your GPU.

Serving the Model

  • Start vLLM with:
    • model set to the Qwen3-Omni Thinking checkpoint ID from the model card
    • max model length around 32,000 (reduce if OOM)
    • bfloat16
    • local media path allowed (if you plan to load local files)
  • Wait for all shards to download and load
  • Confirm the OpenAI-compatible endpoint is running

Example options to consider when launching:

  • --max-model-len 32000
  • --dtype bfloat16
  • Enable local media path in the server or client request if required by your setup

Inference Guidelines

  • Use the OpenAI-compatible endpoint to send requests
  • Compose a single request that includes:
    • Text instruction(s)
    • Paths or references to an image
    • Paths or references to an audio file
    • Paths or references to a video file
  • Make sure local media path access is enabled if you’re passing local file paths
  • Keep an eye on VRAM in a separate terminal (e.g., nvidia-smi)
  • If you see OOM:
    • Reduce the max context length
    • Try smaller inputs or fewer media items per request

Practical Tips

  • Keep requests focused: Pack your multimodal inputs into one coherent prompt instead of multiple fragmented calls.
  • Watch precision and context:
    • bfloat16 helps with memory efficiency
    • Large context lengths significantly increase KV cache usage
  • Use structured instructions for translation tasks:
    • Request literal translations if that’s what you need
    • Ask for a final formatted section to reduce mid-output commentary

FAQs

Do I need an H100 with 80 GB of VRAM?

For the configuration I used (bfloat16 and ~32k context length), the H100 80 GB was a good match. At load, VRAM usage was around 60 GB, and during inference it climbed into the mid-70 GB range. If your GPU has less memory, reduce the max context length or simplify your multimodal inputs.

Can it output audio locally right now?

During my tests, audio output was not working through vLLM/Transformers. I used text outputs only. Once the necessary updates are merged, audio output should become available in local runs.

What languages are supported?

  • Around 119 text languages
  • Speech input in 19 languages
  • Speech output in 10 languages

Check the model card for the exact language lists.

What’s the difference between the Thinker and the Talker?

  • Thinker handles reasoning across modalities and long contexts.
  • Talker turns mid-layer states into fluent speech with a streaming codec decoder and MTP.

What precision should I use?

bfloat16 worked well for me and reduces memory usage compared to fp32.

How can I avoid out-of-memory errors?

  • Reduce the max context length
  • Decrease the number or size of multimodal inputs
  • Restart the server after adjusting settings

Does it support real-time, voice-forward interaction?

The architecture is designed for that, pairing the Thinker’s reasoning with the Talker’s streaming speech. Local audio output support depends on your serving stack; at the time of testing, it wasn’t available via vLLM/Transformers.

How did it perform on multilingual translation?

Mixed. It handled many languages reasonably well, but some outputs inserted English words within translations. Chinese results looked stronger. The final formatted section was much cleaner and more literal, and it correctly flagged a duplicated language in my request.

Conclusion

Qwen3-Omni “Thinking” brings text, images, audio, and video into a single model with a design that separates reasoning from speech generation. Installing it locally with vLLM is straightforward if you have a high-VRAM GPU. In my tests, it handled a single multimodal request that included audio, video, and an image, returning multilingual translations with a mix of solid and imperfect results. The Chinese outputs stood out, while some other languages showed occasional English insertions in earlier parts of the response. The final formatted answer was much cleaner and even caught a duplicated language in the prompt.

Right now, audio output through vLLM/Transformers isn’t functioning locally, so I used text output. Once serving support lands, I expect to revisit speech responses and latency in practice. For teams working with multimodal inputs and multilingual text, the Thinking variant already offers a capable foundation for unified requests—just plan your VRAM headroom, watch context length, and keep your prompts explicit about output format.

Related Posts