KaniTTS Real‑Time Text‑to‑Speech

Table Of Content
- What is KaniTTS?
- Overview of KaniTTS
- Key Features of KaniTTS
- Test Setup and Goals
- Environment Preparation
- Repository and Web Demo
- Installation and Launch
- Step-by-Step Guide
- First Launch Notes
- First Run and Performance Observations
- Latency and Throughput
- Memory Footprint
- Multilingual Voices and Presets
- Language Coverage
- Voice Presets and Style Controls
- Expressivity and Artifacts
- Quality Notes from Testing
- Observed Strengths
- Observed Limitations
- Long-Form Synthesis
- Running the FastAPI Demo
- VRAM and Throughput Summary
- Fine-Tuning and Extensibility
- Practical Fine-Tuning Pointers
- Building for Low Latency
- Practical Tips for Daily Use
- Stability and Style Control
- Security and Deployment Considerations
- Measured Outcomes
- When to Consider KaniTTS
- Conclusion
Today I’m installing and stress-testing KaniTTS, a 370M-parameter real-time text-to-speech model designed for low latency and a small memory footprint. It pairs an LLM-style tokenizer with a neural audio codec to generate clean 22 kHz speech quickly.
The published performance targets are ambitious: about 1 second to generate 15 seconds of audio on an RTX 5080, using roughly 2 GB of VRAM. Reported quality lands around MOS 4.3 with word error under 5%, which is considered strong for real-time synthesis. My goal here is to set it up, run the web demo, measure VRAM use during synthesis, and assess voice quality across multiple languages and presets.
What is KaniTTS?
KaniTTS is a multilingual TTS model aimed at responsive voice experiences. It is built around two core ideas: a tokenizer inspired by modern language models for text processing and a neural audio codec for efficient, high-quality waveform generation at 22 kHz.
The model focuses on fast synthesis, expressive presets, and broad language coverage. It is available under the Apache 2.0 license, and a base model is provided for fine-tuning. This makes it well-suited for conversational agents, voice interfaces, and any application that benefits from low-latency speech generation.
Overview of KaniTTS
Aspect | Details |
---|---|
Model size | ~370M parameters |
Output | 22 kHz speech |
Core components | LLM-style tokenizer + neural audio codec |
Target latency | ~1s to generate 15s of audio on an RTX 5080 (reported) |
Reported memory use | ~2 GB VRAM for the above throughput (reported) |
License | Apache 2.0 |
Supported languages | English, German, Chinese (Mandarin), Cantonese, Korean, Arabic, Spanish |
Presets | Multiple voice and style presets (e.g., British, English female, “legend mode”) |
Interface | FastAPI web demo; local host serving on port 7860 |
Fine-tuning | Base model available for custom training |
Key Features of KaniTTS
- Real-time synthesis with consistent latency targets
- Efficient memory use for GPU deployment
- 22 kHz output that sounds clear and expressive
- Multilingual support across key languages
- Voice and style presets for quick customization
- Apache 2.0 license and base model for fine-tuning
Test Setup and Goals
I’m using a workstation with an NVIDIA RTX A6000 (48 GB VRAM). The aim is to:
- Create a fresh environment and install prerequisites
- Clone the KaniTTS repository
- Launch the FastAPI web demo and let the model download
- Measure VRAM consumption during synthesis
- Evaluate response time and output quality across several voices and languages
Environment Preparation
- Create and activate a new Python virtual environment
- Install system and Python dependencies required by the project
- Verify GPU visibility and driver/toolkit readiness
Repository and Web Demo
- Clone the KaniTTS repository
- Install the project’s Python requirements
- Start the FastAPI demo server
- Wait for the model to download on first run
- Access the web UI at localhost on port 7860
Installation and Launch
Step-by-Step Guide
- Create a virtual environment
- Set up a clean Python environment to avoid dependency conflicts.
- Install prerequisites
- Install the required Python packages listed by the repository.
- Clone the repository
- Fetch the KaniTTS codebase from its Git source.
- Launch the FastAPI demo
- Start the server and confirm the app is serving on localhost:7860.
- Allow time for the model weights to download on first run.
First Launch Notes
- The web UI initializes quickly once dependencies are in place.
- Model weights are fetched automatically by the demo script.
- After download, the interface becomes responsive and ready for text prompts.
First Run and Performance Observations
The demo ran smoothly on the A6000. During synthesis, VRAM usage settled around 3.6 GB in my session. The model responded quickly to short prompts, and the generated audio felt snappy.
Response time was short enough to encourage interactive use. While per-prompt latency varies by preset and text length, turn-around stayed within a responsive range for live testing. The output was clean at 22 kHz, with strong articulation and styling.
Latency and Throughput
- Reported reference: ~1 second to generate ~15 seconds of audio on an RTX 5080
- Practical feel in the demo: quick initiation and fast completion for short prompts
- Long-form synthesis also completed promptly, though more text introduces added processing time
Memory Footprint
- Published target: ~2 GB VRAM for the reported throughput on an RTX 5080
- Observed during testing: about 3.6 GB VRAM use on an A6000 with the web demo running
Multilingual Voices and Presets
KaniTTS includes multiple voices and presets across English, German, Chinese (Mandarin), Cantonese, Korean, Arabic, and Spanish. Switching voices is immediate in the web UI, and synthesis remains fast.
I evaluated several English presets (including British and female voices) and then cycled through available options in German, Mandarin, Cantonese, Korean, Arabic, and Spanish. Across languages, pronunciation quality was strong for short passages, and the model kept its responsive feel.
Language Coverage
- English
- German
- Chinese (Mandarin)
- Cantonese
- Korean
- Arabic
- Spanish
Voice Presets and Style Controls
- The catalog includes multiple presets with distinct timbres and styles.
- “Legend mode” and other expressive presets are included.
- Emotional coloration is present; outputs can convey emphasis and energy.
Expressivity and Artifacts
- Expressivity is strong; the model conveys emotions and emphasis well.
- Certain presets may introduce stutter or prosody drift in rare cases.
- For conversational agents, that expressivity can add realism, but monitoring and prompt design help keep delivery consistent.
Quality Notes from Testing
The overall impression is positive: articulation is clear, the noise floor is low, and words are easy to follow. Short prompts consistently sounded natural and expressive. The speed and VRAM consumption make it practical for interactive applications.
I did notice occasional issues in extended passages, such as minor repetition or misplaced emphasis. These were not frequent but appeared more in long-form prompts than in short ones. Tuning presets and adjusting text input (for example, punctuation for phrasing) can help.
Observed Strengths
- Quick synthesis for both short and long prompts
- Clear pronunciation and expressive delivery
- Good multilingual performance across the listed languages
Observed Limitations
- Minor errors may appear in longer passages
- Some presets can introduce noticeable style artifacts if pushed hard
Long-Form Synthesis
KaniTTS can handle paragraph-length text gracefully. Response times remain reasonable, and the output remains coherent across sentences. Quality is consistent with short-form results, though longer content increases the chance of small errors.
If you plan to produce multi-paragraph narration, consider practical steps:
- Split very long passages into smaller segments
- Add punctuation to guide phrasing
- Select presets that prioritize stability for lengthier content
Running the FastAPI Demo
The provided demo app makes it simple to test the model locally. Once it’s running:
- Enter text in the prompt field
- Choose a language or voice preset
- Trigger synthesis and listen to the generated audio
- Download clips for offline checking or comparison
For measurement during stress testing:
- Keep a GPU monitoring tool open to track VRAM use
- Note time to first audio and completion time for long passages
- Compare presets and languages for differences in latency and memory
VRAM and Throughput Summary
- Startup with the web demo: comfortable headroom on a 48 GB GPU
- Synthesis on short prompts: ~3.6 GB VRAM in my session
- Reported throughput reference (RTX 5080): ~15 seconds of audio per second at ~2 GB VRAM
These figures point to an efficient model that scales well for real-time needs. If you are targeting very low-latency pipelines, you can prioritize shorter input chunks, preselected presets, and a single-GPU deployment for serving.
Fine-Tuning and Extensibility
The team has released a base model suitable for fine-tuning. This is valuable if you want:
- A custom voice tuned to specific data
- Domain-specific pronunciation and jargon handling
- Style controls aligned with brand or product voice
With Apache 2.0 licensing, integration into internal tooling and products is straightforward. Keep training datasets clean and well-labeled, and document your fine-tuning procedure to maintain reproducibility.
Practical Fine-Tuning Pointers
- Prepare transcribed audio with consistent recording quality
- Use balanced datasets to avoid undesirable style bias
- Validate intermediate checkpoints on both short and long-form text
- Monitor WER and subjective MOS during iteration
Building for Low Latency
KaniTTS is built with responsiveness in mind. For production systems:
- Keep the model warm and pinned to GPU memory
- Reuse a single process with a small worker pool
- Minimize text preprocessing latency and cache common prompts
- Consider batching only if it does not add noticeable delay
These strategies help maintain the real-time feel that KaniTTS targets.
Practical Tips for Daily Use
- Choose presets that match your application’s tone
- For multilingual apps, preselect voices per language to avoid overhead
- Save output clips to compare versions when tuning prompts
- Keep an eye on VRAM through your preferred monitoring tool
- For long passages, add punctuation and paragraph breaks to guide rhythm
Stability and Style Control
Expressive models can occasionally produce stutter or irregular prosody, especially with highly emotive presets. If you encounter instability:
- Switch to a more neutral preset for long passages
- Reduce exclamation-heavy phrasing and excessive emphasis markers
- Break long inputs into sensible chunks and review per segment
This balance preserves expressivity while reducing the chance of artifacts.
Security and Deployment Considerations
- Validate text inputs server-side to avoid injection into prompt fields
- Limit maximum input length per request to protect performance
- Log synthesis times and error rates for ongoing monitoring
- Version your voices, presets, and configuration for consistent outputs
These guardrails help keep services reliable as usage scales.
Measured Outcomes
From my run:
- The web UI initialized cleanly after installation
- VRAM usage was around 3.6 GB during active synthesis
- Synthesis speed felt fast and responsive across presets
- Multilingual voices produced clear results for short inputs
- Long-form text worked well with occasional minor errors
These observations align with the model’s stated goals: fast generation, efficient memory use, and expressive delivery across several languages.
When to Consider KaniTTS
- Interactive voice interfaces that need quick turnaround
- Multilingual apps covering English, German, Chinese, Cantonese, Korean, Arabic, and Spanish
- Projects that benefit from expressive presets with controllable style
- Deployments that require permissive licensing and fine-tuning options
Conclusion
KaniTTS sets a strong bar for real-time, multilingual TTS. The pairing of an LLM-style tokenizer and a neural audio codec produces clean 22 kHz audio with fast responses and moderate VRAM needs. In testing, it ran smoothly, switched voices quickly, and delivered expressive speech with minimal friction.
The reported metrics—MOS around 4.3 and word error under 5%—match the listening experience in short-form use. Long-form synthesis is also solid, with minor issues that can be mitigated through preset selection and prompt design. The availability of a fine-tunable base model under Apache 2.0 makes it a practical choice for both experiments and production.
Related Posts

ByteBot Open-Source AI Desktop Agent
Step-by-step guide to install and test ByteBot—an open-source AI desktop agent that automates computer tasks in a virtual desktop environment. Hands-on demo included.

DC-VideoGen by NVIDIA 4K Diffusion Video on 1 GPU
DC-VideoGen’s deep compression VAE delivers up to 14.8x lower latency for diffusion video without quality loss, generating 2160×3840 videos on a single GPU.

IBM's Granite 4 Micro: You Can Run Locally
Discover IBM Granite 4 Micro: a hyper-efficient 3B hybrid, instruction-tuned model for enterprise. See key features, use cases, and how to install locally.