VibeVoice in ComfyUI: Single & Multi-Speaker Voice Cloning

Introduction

This guide shows how to install and use the community ComfyUI custom nodes for Microsoft VibeVoice to clone voices locally. I’ll walk you through single-speaker and multi-speaker workflows, how to choose the right model, fix a common setup error, and run multilingual cloning. You’ll see what to expect on CPU vs. GPU and how to feed dialogue from either inline text or a file.

I’m working on Ubuntu with an NVIDIA RTX A6000 (48 GB VRAM), but you can run on CPU too. The steps below follow the exact order of setup, testing, troubleshooting, and results, just as I did them.

What is Microsoft VibeVoice with ComfyUI?

Microsoft VibeVoice is a voice cloning model that can mimic a speaker from a short audio sample. Through community-made custom nodes, VibeVoice plugs into ComfyUI so you can run it locally with visual workflows.

With these nodes, you can:

Clone a single speaker from a reference audio file.
Run multi-speaker dialogue with separate references for each voice.
Input text directly in ComfyUI or from a text file node.
Generate speech in multiple languages using the cloned voice.

Overview

Component	Purpose	Notes
ComfyUI	Node-based UI for local AI workflows	Assumed installed before starting
VibeVoice custom nodes	Adds VibeVoice voice cloning to ComfyUI	Clone into ComfyUI/custom_nodes
VibeVoice 1.5B model	Good for single-speaker cloning	Lower resource requirements
VibeVoice Large model	Best for multi-speaker work	~17 GB disk, ~18 GB VRAM during generation
Input formats	Reference audio files	Supports WAV and MP3
Text input	Content to speak	Inline text or connect a text-file node
Port	Access in browser	Default used here: http://localhost:8188
OS used here	Ubuntu	CPU mode also works (slower)

Key Features

Local voice cloning for single and multi-speaker workflows in ComfyUI.
Model selection inside the workflow (1.5B or Large).
Text input inline or via a file node for longer scripts.
Multilingual synthesis from a single reference voice.
Works on GPU or CPU (GPU is faster; CPU is viable for testing).
Example JSON workflows included in the repo.
Practical troubleshooting for a known transformers version error.
Supports MP3 and WAV reference audio files.
Ability to add more speakers to the multi-speaker workflow.

How VibeVoice with ComfyUI Works

Core idea

Provide a short reference audio clip of a speaker.
Select a VibeVoice model.
Enter the text you want the cloned voice to speak.
Generate speech that follows the voice characteristics of the reference.

Single-speaker workflow

One reference audio input.
One text input.
The model extracts voice characteristics and synthesizes speech in the same voice.

Multi-speaker workflow

Multiple reference audio inputs (one per speaker).
A structured dialogue text input (inline or from a file).
The workflow routes lines to the correct speaker following the order you define.

How to Use It

Prerequisites

ComfyUI installed locally.
Python and Git available.
A few short, clear reference audio files (WAV or MP3).
Enough disk space for the model download (Large model ~17 GB).

Step 1: Install the custom nodes

Open a terminal and go to your ComfyUI directory.
Navigate to the custom_nodes folder.
Clone the VibeVoice custom nodes repo.

Example:

cd /path/to/ComfyUI
cd custom_nodes
git clone <repo_url_for_vibevoice_custom_nodes>

If needed, install the node requirements:

cd <cloned_repo_directory>
pip install -r requirements.txt

Step 2: Launch ComfyUI

From the main ComfyUI directory, start ComfyUI as you normally do.
Open http://localhost:8188 in your browser.

Step 3: Load the example workflows

In ComfyUI, select File > Open.
From the custom nodes repo, open the examples folder.
Load the Single Speaker workflow JSON first.

Step 4: Configure the Single Speaker workflow

Select an audio file for the reference voice (left panel file picker).
In the model selector:
- Use VibeVoice 1.5B for basic single-speaker cloning.
- Use VibeVoice Large for best quality (also fine for multi-speaker).
Keep sampler/guider defaults unless you know what you’re changing.
Enter the text to speak.
Click Run.

Step 5: Fix “model loading failed” (if it appears)

If you see a model loading error related to transformers:

Copy the transformers version suggested in the error message.
Install that version, then relaunch ComfyUI.

Example:

pip install transformers==<version_from_error>

If issues persist, install the custom node requirements:

cd ComfyUI/custom_nodes/<vibevoice_repo_folder>
pip install -r requirements.txt

Then relaunch ComfyUI and reload the workflow.

Step 6: Confirm model download and generation

On the first run, the model will download automatically.
When generation starts, monitor your system resources if you want:
- The Large model can reach ~18 GB VRAM during synthesis.
- CPU mode is possible, just slower.

Step 7: Test a second reference voice

Swap the reference audio to another speaker (e.g., a different gender or timbre).
Generate again to compare results.

Multi-Speaker Voice Cloning

Load the multi-speaker workflow

In ComfyUI, go to File > Open.
Load the Multi Speaker workflow JSON from the examples folder.

Add speakers and dialogue

Select a reference audio file for each speaker.
Provide your dialogue in one of two ways:
- Inline text: type directly in the text node.
- Text file: connect a file input node to the dialogue input.

For a file-based script:

Upload the text file into a reachable location (e.g., ComfyUI’s temp directory).
Select it in the file node and connect it to the dialogue input.

Run and review

Click Run to generate the conversation.
Note:
- Voice assignment can occasionally swap on complex scripts.
- Higher-quality reference audio tends to produce more expressive results.
- You can extend the workflow to add more speakers if needed.

Using a Text File for Input

Inline vs. file-based input

Inline text is quick for short scripts.
A text file is better for long dialogue or multiple languages.

Steps to connect a text file

Add or select the file node in the workflow.
Choose your text file (e.g., long.txt).
Connect the file node output to the dialogue/text input in the workflow.
Click Run.

If needed, place your file in ComfyUI’s temp folder so it’s easy to select from the browser interface.

Multilingual Cloning Results

What I tested

I switched back to the single-speaker workflow and tested cloned speech across several languages. The languages included Arabic, Indonesian, Spanish, Swahili, Norwegian, Hindi, and Urdu.

Observations

Arabic, Indonesian, and Spanish produced strong results.
Swahili improved after a second attempt.
Norwegian missed a bit of timing in one run.
Hindi was solid.
Urdu had a noticeable accent, but the cloned voice identity was still recognizable.

Practical notes

Quality of the reference audio matters. Clear, studio-grade recordings can produce more expressive results.
Non-native language output may carry accent differences.
A second run can help if timing or expression is slightly off.

Detailed Setup and Run Guide

Single-Speaker: Step-by-step

Install custom nodes:
- Clone the repo into ComfyUI/custom_nodes.
- Optionally install requirements.txt.
Launch ComfyUI and open the Single Speaker example.
Select a reference audio file (WAV or MP3).
Choose the model:
- 1.5B for lighter runs.
- Large for best quality (requires more disk/VRAM).
Enter the text to speak.
Click Run.
If you hit a transformers error:
- Install the version mentioned in the error and relaunch.
Listen and evaluate.

Multi-Speaker: Step-by-step

Open the Multi Speaker example.
Select reference audio files for each speaker.
Provide dialogue:
- Inline text, or
- Connect a text file node (upload a .txt file and link it).
Click Run and review the output.
Add more speakers if needed by extending the workflow.

Multilingual: Step-by-step

Use the Single Speaker example.
Select a clear reference audio.
Enter text in the target language.
Click Run.
If expression or timing is off, try a second pass or adjust the reference audio.

Model Selection and Resource Notes

Choosing the right model

VibeVoice 1.5B:
- Good starting point for single-speaker cloning.
- Lower resource requirements.
VibeVoice Large:
- Strong for both single and multi-speaker scenarios.
- Around 17 GB disk space for the model.
- Peaks around 18 GB VRAM during generation.

CPU vs. GPU

GPU:
- Faster inference.
- Recommended for the Large model and multi-speaker work.
CPU:
- Works for testing and shorter clips.
- Slower generation.

Common Troubleshooting

“Model loading failed” due to transformers version

Cause: Mismatch between installed transformers and what the node expects.
Fix:
1. Copy the transformers version from the error message.
2. Install it:
```
pip install transformers==<version_from_error>
```
3. Relaunch ComfyUI.

If still failing:

Install the node requirements from the custom node folder:
```
pip install -r requirements.txt
```
Relaunch ComfyUI and re-open the workflow.

File input not loading

Ensure the file node points to a valid .txt file.
Place the file in a folder ComfyUI can access (e.g., its temp directory).
Reconnect the node and run again.

Voice assignment mix-ups in multi-speaker runs

Keep each speaker’s reference audio clear and consistent.
Shorten or simplify the dialogue to test routing.
Re-run if lines appear misassigned.

FAQs

Can I run this on CPU?

Yes. It’s slower than GPU, but single-speaker cloning and shorter clips are feasible on CPU.

How much disk space do I need?

VibeVoice Large is about 17 GB.
Allow extra space for caches and temporary files.

How much VRAM does it use?

The Large model can reach around 18 GB VRAM during generation.
The 1.5B model uses less.

What audio formats are supported for the reference voice?

WAV and MP3 both work well.

Install the exact transformers version shown in the error, then relaunch ComfyUI:

pip install transformers==<version_from_error>

Where do I put my dialogue text file?

Upload it where ComfyUI can access it (for example, the temp directory), then select it in the file node and connect it to the dialogue input.

Can I add more than two speakers?

Yes. The multi-speaker workflow can be extended to add more speakers by duplicating the speaker blocks and routing their lines.

Does multilingual cloning work?

Yes. You can input text in different languages and the model will attempt to speak in the same cloned voice. Results vary by language and reference quality.

Do I need constant internet access?

You’ll need internet to download the model on the first run. After that, local runs don’t require a connection.

What affects expressiveness and timing?

Reference audio quality, clarity, and consistency make a noticeable difference. Studio-grade clips often yield better results.

Conclusion

Installing the VibeVoice custom nodes for ComfyUI enables reliable local voice cloning with both single and multi-speaker workflows. The included example JSON files make setup straightforward: select a reference audio, pick a model, input text, and run. If you hit a transformers version error, install the version suggested in the error and relaunch—this resolves the most common setup issue.

For multi-speaker projects, you can feed dialogue inline or from a text file, add more speakers, and refine references for clarity. Multilingual synthesis works well in many cases, and quality improves with better reference audio. GPU is ideal for speed and multi-speaker sessions, but CPU mode can handle shorter tasks. Overall, the workflow is practical, local, and effective for creating cloned voices with ComfyUI.