SoulX-Podcast Setup: Run Realistic Long-Form AI Podcasts Locally

In this guide, I walk through SoulX-Podcast, a model built for long-form, multi-turn, multi-speaker dialogue synthesis with rich paralinguistic diversity. I install it locally, run the demo, and evaluate how well it handles speaker transitions, emotion cues, and dialect support. I also share notes on GPU memory usage, interface behavior, and a concise overview of the architecture.

The focus here is practical: get SoulX-Podcast running, feed it reference voices, generate a podcast-style conversation, and assess quality. Along the way, I point out what works, what needs refinement, and how to format inputs for stronger results.

What is SoulX-Podcast?

SoulX-Podcast is designed to generate extended podcast-style conversations that feel natural in pacing and delivery. It supports multiple speakers in a single session, handles turn-taking, and blends linguistic fluency with paralinguistic signals such as laughter, sighs, and intonation changes.

The model targets authenticity in performance. It aims to reproduce the flow of real conversation, capturing personality cues and dialectal nuances within a coherent narrative. It supports English, Mandarin, and several major Chinese dialects, enabling podcast creation across language contexts.

Overview of SoulX-Podcast

Attribute	Summary
Purpose	Long-form, multi-turn, multi-speaker dialogue synthesis for podcast-style content
Paralinguistic support	Laughter, sighs, intonation shifts, and other non-verbal cues via tokens/tags
Languages	English, Mandarin, and major Chinese dialects
Voice cloning	Zero-shot speaker conditioning from short reference samples
Model size	~1.7B parameters (distributed in three shards)
Interface	Local Gradio app
Default port	localhost:7860
Typical VRAM usage	Around 5.6 GB during generation (observed)
OS used in test	Ubuntu
GPU used in test	NVIDIA H100 80 GB (not required; smaller GPUs work)

Key Features of SoulX-Podcast

Multi-turn, multi-speaker dialogue generation with natural pacing.
Zero-shot voice cloning from brief reference audio clips.
Paralinguistic tokens for non-verbal events like laughter and sighs.
Support for Mandarin and several Chinese dialects in addition to English.
Smooth speaker transitions and consistent vocal timbre across turns.
Local Gradio interface for quick testing and iteration.
Relatively low VRAM requirement for generation compared to larger systems.

Install and Run Locally

I installed and ran SoulX-Podcast on Ubuntu with a single NVIDIA GPU. The model also runs on smaller GPUs than the one used in this walkthrough.

System setup

OS: Ubuntu
GPU: NVIDIA H100 80 GB (you do not need this much VRAM)
Python environment: set up from the repository’s root directory

Step-by-step installation

Clone the repository.
From the repository root, install the Python requirements. Installation took a few minutes.
Authenticate with Hugging Face to access model files.
Download the SoulX-Podcast model.
Launch the Gradio demo from the repository root.

Hugging Face authentication

Obtain a free read token from your Hugging Face profile.
Paste the token when prompted and confirm to complete login.

Model download

Run the provided download command from the repository root.
The model size is about 1.7B parameters split across three shards.

Launch the demo

Start the Gradio app.
The interface runs locally at http://localhost:7860.

Using the Interface

Language and layout

The interface defaults to Chinese on first launch. Switch the UI to English if needed. The layout provides fields for reference audio, text prompts for each speaker, and optional dialect selection.

Add reference speakers

Upload a short voice sample for each speaker you plan to include (e.g., one male, one female).
The reference clips can be short; the system conditions on these to match voice identity.

Enter dialogue and generate

Paste the dialogue text in the correct text box.
Do not select a dialect for English content. Dialect selections apply to Chinese.
Click Generate Audio to synthesize the conversation.
During generation, I observed VRAM usage around 5.614 GB.

Dialect selection for Chinese

If you are generating Mandarin or a regional Chinese dialect (e.g., Cantonese), choose the appropriate dialect from the selector. For English, leave dialect unset.

GPU memory usage

On my system, generation hovered around 5.6 GB of VRAM, leaving plenty of headroom even on modest GPUs.

Results and Observations

Speaker transitions and cloning quality

The transition between speakers was smooth, with strong consistency in voice identity across turns. The cloned voices remained stable and distinct. The pacing felt natural, and turn-taking followed the provided script without drifting.

Paralinguistic cues and emotion handling

The model captured rhythm and intonation well. However, it did not always respond to parenthetical cues as intended when those cues were placed in plain parentheses. Voice cloning remained accurate, but emotion within words needed refinement for English when relying purely on unformatted cues.

Longer dialogue and same-speaker transitions

I tested longer text with emotional hints and cases where the same speaker had consecutive turns (e.g., S2 followed by S2). Same-speaker continuity held up well. Emotion rendering improved compared to shorter tests, but still left room for stronger expressiveness in English without explicit tags.

Chinese Audio and Dialect Support

SoulX-Podcast includes examples in Chinese and supports dialect switching. In my tests, emotion flow in Chinese seemed stronger than in English when using the same formatting approach. The interface exposes tags for non-verbal sounds such as laughter, which appear to contribute to more expressive delivery in Mandarin and dialectal settings.

Built-in tags for laughter and other paralinguistic events are present, and overall emotional expression in Chinese felt more coherent with those tags in place. Dialect selection also helps align the style of speech for Mandarin variants.

Architecture Overview

While running a new generation pass, I reviewed the system design. SoulX-Podcast uses a multi-speaker, multi-turn generative framework that integrates paralinguistic conditioning, dialect embeddings, and zero-shot voice cloning.

Core design

Encoder–decoder structure with attention.
Contextual embeddings capture speaker identity, linguistic content, and emotional state together.
Paralinguistic tokens add non-verbal events to the synthesis process.

Components

Dialect and prosody embeddings support transitions between speech styles across languages and dialects.
Conditioning inputs provide control over voice identity and render consistent timbre through long sequences.

Zero-shot speaker encoding

The zero-shot capability suggests a pre-trained speaker encoder (akin to D-vector style approaches). This allows the system to infer speaker identity from short reference clips without per-speaker training.

A Second Pass with Tagged Cues

I prepared another prompt with explicit tags for non-verbal cues (e.g., laughter, sighs) and re-uploaded the female reference voice after refreshing the page to ensure correct association.

Input formatting

Place paralinguistic tags directly within the script at the exact points they should occur.
Keep the speaker labels consistent and make sure the dialogue goes into the correct text field.
Avoid mixing reference audio and text prompt fields.

Outcome

With clearer tagging and proper field placement, the generated conversation improved. Speaker separation remained clear, and the system respected turn order. Expressiveness increased when cues were embedded as tags in-line with the text.

Practical Tips

Reference audio: Use clean, noise-free clips for each speaker. Keep them short but distinct.
Input placement: Double-check that dialogue is pasted into the correct prompt box before generation.
Cues formatting: Insert paralinguistic tags in-line at the moment you want them to occur; bracketed cues outside the sentence may be ignored.
Dialects: Use the dialect selector for Chinese content; leave it unset for English.
Iteration: If a speaker is not picked correctly, refresh the interface, re-upload the reference clips, and regenerate.
Monitoring: Keep an eye on VRAM usage during generation. Around 5.6 GB was typical in my runs.

Performance Notes

Model size: ~1.7B parameters across three shards.
Local runtime: The Gradio demo starts quickly and responds well.
Generation speed: Reasonably fast for multi-turn outputs, with stable VRAM usage observed around 5.614 GB.
Stability: Long dialogues maintained consistent voice identity and clear turn-taking.

Step-by-Step Guide: From Setup to First Podcast

Prepare your environment
- Install system dependencies as required by the repository.
- Set up a fresh Python environment.
Get the code
- Clone the repository and move to its root directory.
Install requirements
- Install the dependencies listed by the project. This took a few minutes.
Authenticate with Hugging Face
- Get your read token and log in so the model files can be downloaded.
Download the model
- Run the provided command from the project root to fetch the 1.7B-parameter model shards.
Launch the Gradio app
- Start the demo and open http://localhost:7860 in your browser.
Configure the UI
- Switch the interface to English if needed.
- Upload reference audio for each speaker.
- Paste the dialogue for each turn in the appropriate text area.
Generate and review
- Leave dialect unselected for English; pick the right dialect for Chinese.
- Click Generate Audio and monitor GPU usage.
- Play back the result and adjust your tags or formatting as needed.

Troubleshooting Checklist

Voice assignment issues
- Refresh the page, re-upload reference audios, and regenerate.
Emotion not as expected
- Add explicit tags (e.g., laughter, sighs) at the exact points in the text.
- Keep sentences concise and place cues close to the words they modify.
Wrong field usage
- Confirm that the dialogue is placed in the correct text prompt area, not in the reference audio section.
Dialect mismatch
- For Chinese outputs, ensure the correct dialect is selected to improve style and prosody.
Resource limits
- Close other GPU-intensive apps if VRAM is tight; typical usage was around 5.6 GB in my tests.

Conclusion

SoulX-Podcast delivers long-form, multi-speaker dialogue that holds together across turns with consistent voice identity and natural pacing. The system supports paralinguistic tokens and dialect embeddings, enabling a richer podcast experience, particularly in Mandarin and Chinese dialects.

On the practical side, setup is straightforward: install dependencies, authenticate with Hugging Face, download the model, and launch the Gradio demo. With clean reference audio, careful input formatting, and explicit inline cues, the generated output improves noticeably. Voice cloning is strong, transitions are smooth, and performance is solid on a single GPU with modest VRAM needs.