VibeVoice in ComfyUI: Single & Multi-Speaker Voice Cloning

Table Of Content
- Introduction
- What is Microsoft VibeVoice with ComfyUI?
- Overview
- Key Features
- How VibeVoice with ComfyUI Works
- Core idea
- Single-speaker workflow
- Multi-speaker workflow
- How to Use It
- Prerequisites
- Step 1: Install the custom nodes
- Step 2: Launch ComfyUI
- Step 3: Load the example workflows
- Step 4: Configure the Single Speaker workflow
- Step 5: Fix “model loading failed” (if it appears)
- Step 6: Confirm model download and generation
- Step 7: Test a second reference voice
- Multi-Speaker Voice Cloning
- Load the multi-speaker workflow
- Add speakers and dialogue
- Run and review
- Using a Text File for Input
- Inline vs. file-based input
- Steps to connect a text file
- Multilingual Cloning Results
- What I tested
- Observations
- Practical notes
- Detailed Setup and Run Guide
- Single-Speaker: Step-by-step
- Multi-Speaker: Step-by-step
- Multilingual: Step-by-step
- Model Selection and Resource Notes
- Choosing the right model
- CPU vs. GPU
- Common Troubleshooting
- “Model loading failed” due to transformers version
- File input not loading
- Voice assignment mix-ups in multi-speaker runs
- FAQs
- Can I run this on CPU?
- How much disk space do I need?
- How much VRAM does it use?
- What audio formats are supported for the reference voice?
- How do I fix a transformers-related error?
- Where do I put my dialogue text file?
- Can I add more than two speakers?
- Does multilingual cloning work?
- Do I need constant internet access?
- What affects expressiveness and timing?
- Conclusion
Introduction
This guide shows how to install and use the community ComfyUI custom nodes for Microsoft VibeVoice to clone voices locally. I’ll walk you through single-speaker and multi-speaker workflows, how to choose the right model, fix a common setup error, and run multilingual cloning. You’ll see what to expect on CPU vs. GPU and how to feed dialogue from either inline text or a file.
I’m working on Ubuntu with an NVIDIA RTX A6000 (48 GB VRAM), but you can run on CPU too. The steps below follow the exact order of setup, testing, troubleshooting, and results, just as I did them.
What is Microsoft VibeVoice with ComfyUI?
Microsoft VibeVoice is a voice cloning model that can mimic a speaker from a short audio sample. Through community-made custom nodes, VibeVoice plugs into ComfyUI so you can run it locally with visual workflows.
With these nodes, you can:
- Clone a single speaker from a reference audio file.
- Run multi-speaker dialogue with separate references for each voice.
- Input text directly in ComfyUI or from a text file node.
- Generate speech in multiple languages using the cloned voice.
Overview
Component | Purpose | Notes |
---|---|---|
ComfyUI | Node-based UI for local AI workflows | Assumed installed before starting |
VibeVoice custom nodes | Adds VibeVoice voice cloning to ComfyUI | Clone into ComfyUI/custom_nodes |
VibeVoice 1.5B model | Good for single-speaker cloning | Lower resource requirements |
VibeVoice Large model | Best for multi-speaker work | ~17 GB disk, ~18 GB VRAM during generation |
Input formats | Reference audio files | Supports WAV and MP3 |
Text input | Content to speak | Inline text or connect a text-file node |
Port | Access in browser | Default used here: http://localhost:8188 |
OS used here | Ubuntu | CPU mode also works (slower) |
Key Features
- Local voice cloning for single and multi-speaker workflows in ComfyUI.
- Model selection inside the workflow (1.5B or Large).
- Text input inline or via a file node for longer scripts.
- Multilingual synthesis from a single reference voice.
- Works on GPU or CPU (GPU is faster; CPU is viable for testing).
- Example JSON workflows included in the repo.
- Practical troubleshooting for a known transformers version error.
- Supports MP3 and WAV reference audio files.
- Ability to add more speakers to the multi-speaker workflow.
How VibeVoice with ComfyUI Works
Core idea
- Provide a short reference audio clip of a speaker.
- Select a VibeVoice model.
- Enter the text you want the cloned voice to speak.
- Generate speech that follows the voice characteristics of the reference.
Single-speaker workflow
- One reference audio input.
- One text input.
- The model extracts voice characteristics and synthesizes speech in the same voice.
Multi-speaker workflow
- Multiple reference audio inputs (one per speaker).
- A structured dialogue text input (inline or from a file).
- The workflow routes lines to the correct speaker following the order you define.
How to Use It
Prerequisites
- ComfyUI installed locally.
- Python and Git available.
- A few short, clear reference audio files (WAV or MP3).
- Enough disk space for the model download (Large model ~17 GB).
Step 1: Install the custom nodes
- Open a terminal and go to your ComfyUI directory.
- Navigate to the custom_nodes folder.
- Clone the VibeVoice custom nodes repo.
Example:
cd /path/to/ComfyUI
cd custom_nodes
git clone <repo_url_for_vibevoice_custom_nodes>
If needed, install the node requirements:
cd <cloned_repo_directory>
pip install -r requirements.txt
Step 2: Launch ComfyUI
- From the main ComfyUI directory, start ComfyUI as you normally do.
- Open http://localhost:8188 in your browser.
Step 3: Load the example workflows
- In ComfyUI, select File > Open.
- From the custom nodes repo, open the examples folder.
- Load the Single Speaker workflow JSON first.
Step 4: Configure the Single Speaker workflow
- Select an audio file for the reference voice (left panel file picker).
- In the model selector:
- Use VibeVoice 1.5B for basic single-speaker cloning.
- Use VibeVoice Large for best quality (also fine for multi-speaker).
- Keep sampler/guider defaults unless you know what you’re changing.
- Enter the text to speak.
- Click Run.
Step 5: Fix “model loading failed” (if it appears)
If you see a model loading error related to transformers:
- Copy the transformers version suggested in the error message.
- Install that version, then relaunch ComfyUI.
Example:
pip install transformers==<version_from_error>
If issues persist, install the custom node requirements:
cd ComfyUI/custom_nodes/<vibevoice_repo_folder>
pip install -r requirements.txt
Then relaunch ComfyUI and reload the workflow.
Step 6: Confirm model download and generation
- On the first run, the model will download automatically.
- When generation starts, monitor your system resources if you want:
- The Large model can reach ~18 GB VRAM during synthesis.
- CPU mode is possible, just slower.
Step 7: Test a second reference voice
- Swap the reference audio to another speaker (e.g., a different gender or timbre).
- Generate again to compare results.
Multi-Speaker Voice Cloning
Load the multi-speaker workflow
- In ComfyUI, go to File > Open.
- Load the Multi Speaker workflow JSON from the examples folder.
Add speakers and dialogue
- Select a reference audio file for each speaker.
- Provide your dialogue in one of two ways:
- Inline text: type directly in the text node.
- Text file: connect a file input node to the dialogue input.
For a file-based script:
- Upload the text file into a reachable location (e.g., ComfyUI’s temp directory).
- Select it in the file node and connect it to the dialogue input.
Run and review
- Click Run to generate the conversation.
- Note:
- Voice assignment can occasionally swap on complex scripts.
- Higher-quality reference audio tends to produce more expressive results.
- You can extend the workflow to add more speakers if needed.
Using a Text File for Input
Inline vs. file-based input
- Inline text is quick for short scripts.
- A text file is better for long dialogue or multiple languages.
Steps to connect a text file
- Add or select the file node in the workflow.
- Choose your text file (e.g., long.txt).
- Connect the file node output to the dialogue/text input in the workflow.
- Click Run.
If needed, place your file in ComfyUI’s temp folder so it’s easy to select from the browser interface.
Multilingual Cloning Results
What I tested
I switched back to the single-speaker workflow and tested cloned speech across several languages. The languages included Arabic, Indonesian, Spanish, Swahili, Norwegian, Hindi, and Urdu.
Observations
- Arabic, Indonesian, and Spanish produced strong results.
- Swahili improved after a second attempt.
- Norwegian missed a bit of timing in one run.
- Hindi was solid.
- Urdu had a noticeable accent, but the cloned voice identity was still recognizable.
Practical notes
- Quality of the reference audio matters. Clear, studio-grade recordings can produce more expressive results.
- Non-native language output may carry accent differences.
- A second run can help if timing or expression is slightly off.
Detailed Setup and Run Guide
Single-Speaker: Step-by-step
- Install custom nodes:
- Clone the repo into ComfyUI/custom_nodes.
- Optionally install requirements.txt.
- Launch ComfyUI and open the Single Speaker example.
- Select a reference audio file (WAV or MP3).
- Choose the model:
- 1.5B for lighter runs.
- Large for best quality (requires more disk/VRAM).
- Enter the text to speak.
- Click Run.
- If you hit a transformers error:
- Install the version mentioned in the error and relaunch.
- Listen and evaluate.
Multi-Speaker: Step-by-step
- Open the Multi Speaker example.
- Select reference audio files for each speaker.
- Provide dialogue:
- Inline text, or
- Connect a text file node (upload a .txt file and link it).
- Click Run and review the output.
- Add more speakers if needed by extending the workflow.
Multilingual: Step-by-step
- Use the Single Speaker example.
- Select a clear reference audio.
- Enter text in the target language.
- Click Run.
- If expression or timing is off, try a second pass or adjust the reference audio.
Model Selection and Resource Notes
Choosing the right model
- VibeVoice 1.5B:
- Good starting point for single-speaker cloning.
- Lower resource requirements.
- VibeVoice Large:
- Strong for both single and multi-speaker scenarios.
- Around 17 GB disk space for the model.
- Peaks around 18 GB VRAM during generation.
CPU vs. GPU
- GPU:
- Faster inference.
- Recommended for the Large model and multi-speaker work.
- CPU:
- Works for testing and shorter clips.
- Slower generation.
Common Troubleshooting
“Model loading failed” due to transformers version
- Cause: Mismatch between installed transformers and what the node expects.
- Fix:
- Copy the transformers version from the error message.
- Install it:
pip install transformers==<version_from_error>
- Relaunch ComfyUI.
If still failing:
- Install the node requirements from the custom node folder:
pip install -r requirements.txt
- Relaunch ComfyUI and re-open the workflow.
File input not loading
- Ensure the file node points to a valid .txt file.
- Place the file in a folder ComfyUI can access (e.g., its temp directory).
- Reconnect the node and run again.
Voice assignment mix-ups in multi-speaker runs
- Keep each speaker’s reference audio clear and consistent.
- Shorten or simplify the dialogue to test routing.
- Re-run if lines appear misassigned.
FAQs
Can I run this on CPU?
Yes. It’s slower than GPU, but single-speaker cloning and shorter clips are feasible on CPU.
How much disk space do I need?
- VibeVoice Large is about 17 GB.
- Allow extra space for caches and temporary files.
How much VRAM does it use?
- The Large model can reach around 18 GB VRAM during generation.
- The 1.5B model uses less.
What audio formats are supported for the reference voice?
WAV and MP3 both work well.
How do I fix a transformers-related error?
Install the exact transformers version shown in the error, then relaunch ComfyUI:
pip install transformers==<version_from_error>
Where do I put my dialogue text file?
Upload it where ComfyUI can access it (for example, the temp directory), then select it in the file node and connect it to the dialogue input.
Can I add more than two speakers?
Yes. The multi-speaker workflow can be extended to add more speakers by duplicating the speaker blocks and routing their lines.
Does multilingual cloning work?
Yes. You can input text in different languages and the model will attempt to speak in the same cloned voice. Results vary by language and reference quality.
Do I need constant internet access?
You’ll need internet to download the model on the first run. After that, local runs don’t require a connection.
What affects expressiveness and timing?
Reference audio quality, clarity, and consistency make a noticeable difference. Studio-grade clips often yield better results.
Conclusion
Installing the VibeVoice custom nodes for ComfyUI enables reliable local voice cloning with both single and multi-speaker workflows. The included example JSON files make setup straightforward: select a reference audio, pick a model, input text, and run. If you hit a transformers version error, install the version suggested in the error and relaunch—this resolves the most common setup issue.
For multi-speaker projects, you can feed dialogue inline or from a text file, add more speakers, and refine references for clarity. Multilingual synthesis works well in many cases, and quality improves with better reference audio. GPU is ideal for speed and multi-speaker sessions, but CPU mode can handle shorter tasks. Overall, the workflow is practical, local, and effective for creating cloned voices with ComfyUI.
Related Posts

K2 Think: Install & Advanced Reasoning AI
Discover K2 Think from the UAE: step-by-step local install, architecture overview, and real benchmark tests. See how this advanced reasoning AI performs.

Klear-46B by (Kwai): Local Install Guide
Install and run Kwai's Klear-46B MoE locally. Get setup steps, hardware tips, and real-world performance results from this efficient 46B sparse LLM.

Install Moondream 3 Locally
Step-by-step guide to install Moondream 3 locally—a small, blazing-fast vision-language model delivering frontier-level reasoning for real-world tasks.