KaniTTS Real‑Time Text‑to‑Speech

Today I’m installing and stress-testing KaniTTS, a 370M-parameter real-time text-to-speech model designed for low latency and a small memory footprint. It pairs an LLM-style tokenizer with a neural audio codec to generate clean 22 kHz speech quickly.

The published performance targets are ambitious: about 1 second to generate 15 seconds of audio on an RTX 5080, using roughly 2 GB of VRAM. Reported quality lands around MOS 4.3 with word error under 5%, which is considered strong for real-time synthesis. My goal here is to set it up, run the web demo, measure VRAM use during synthesis, and assess voice quality across multiple languages and presets.

What is KaniTTS?

KaniTTS is a multilingual TTS model aimed at responsive voice experiences. It is built around two core ideas: a tokenizer inspired by modern language models for text processing and a neural audio codec for efficient, high-quality waveform generation at 22 kHz.

The model focuses on fast synthesis, expressive presets, and broad language coverage. It is available under the Apache 2.0 license, and a base model is provided for fine-tuning. This makes it well-suited for conversational agents, voice interfaces, and any application that benefits from low-latency speech generation.

Overview of KaniTTS

Aspect	Details
Model size	~370M parameters
Output	22 kHz speech
Core components	LLM-style tokenizer + neural audio codec
Target latency	~1s to generate 15s of audio on an RTX 5080 (reported)
Reported memory use	~2 GB VRAM for the above throughput (reported)
License	Apache 2.0
Supported languages	English, German, Chinese (Mandarin), Cantonese, Korean, Arabic, Spanish
Presets	Multiple voice and style presets (e.g., British, English female, “legend mode”)
Interface	FastAPI web demo; local host serving on port 7860
Fine-tuning	Base model available for custom training

Key Features of KaniTTS

Real-time synthesis with consistent latency targets
Efficient memory use for GPU deployment
22 kHz output that sounds clear and expressive
Multilingual support across key languages
Voice and style presets for quick customization
Apache 2.0 license and base model for fine-tuning

Test Setup and Goals

I’m using a workstation with an NVIDIA RTX A6000 (48 GB VRAM). The aim is to:

Create a fresh environment and install prerequisites
Clone the KaniTTS repository
Launch the FastAPI web demo and let the model download
Measure VRAM consumption during synthesis
Evaluate response time and output quality across several voices and languages

Environment Preparation

Create and activate a new Python virtual environment
Install system and Python dependencies required by the project
Verify GPU visibility and driver/toolkit readiness

Repository and Web Demo

Clone the KaniTTS repository
Install the project’s Python requirements
Start the FastAPI demo server
Wait for the model to download on first run
Access the web UI at localhost on port 7860

Installation and Launch

Step-by-Step Guide

Create a virtual environment

Set up a clean Python environment to avoid dependency conflicts.

Install prerequisites

Install the required Python packages listed by the repository.

Clone the repository

Fetch the KaniTTS codebase from its Git source.

Launch the FastAPI demo

Start the server and confirm the app is serving on localhost:7860.
Allow time for the model weights to download on first run.

First Launch Notes

The web UI initializes quickly once dependencies are in place.
Model weights are fetched automatically by the demo script.
After download, the interface becomes responsive and ready for text prompts.

First Run and Performance Observations

The demo ran smoothly on the A6000. During synthesis, VRAM usage settled around 3.6 GB in my session. The model responded quickly to short prompts, and the generated audio felt snappy.

Response time was short enough to encourage interactive use. While per-prompt latency varies by preset and text length, turn-around stayed within a responsive range for live testing. The output was clean at 22 kHz, with strong articulation and styling.

Latency and Throughput

Reported reference: ~1 second to generate ~15 seconds of audio on an RTX 5080
Practical feel in the demo: quick initiation and fast completion for short prompts
Long-form synthesis also completed promptly, though more text introduces added processing time

Memory Footprint

Published target: ~2 GB VRAM for the reported throughput on an RTX 5080
Observed during testing: about 3.6 GB VRAM use on an A6000 with the web demo running

Multilingual Voices and Presets

KaniTTS includes multiple voices and presets across English, German, Chinese (Mandarin), Cantonese, Korean, Arabic, and Spanish. Switching voices is immediate in the web UI, and synthesis remains fast.

I evaluated several English presets (including British and female voices) and then cycled through available options in German, Mandarin, Cantonese, Korean, Arabic, and Spanish. Across languages, pronunciation quality was strong for short passages, and the model kept its responsive feel.

Language Coverage

English
German
Chinese (Mandarin)
Cantonese
Korean
Arabic
Spanish

Voice Presets and Style Controls

The catalog includes multiple presets with distinct timbres and styles.
“Legend mode” and other expressive presets are included.
Emotional coloration is present; outputs can convey emphasis and energy.

Expressivity and Artifacts

Expressivity is strong; the model conveys emotions and emphasis well.
Certain presets may introduce stutter or prosody drift in rare cases.
For conversational agents, that expressivity can add realism, but monitoring and prompt design help keep delivery consistent.

Quality Notes from Testing

The overall impression is positive: articulation is clear, the noise floor is low, and words are easy to follow. Short prompts consistently sounded natural and expressive. The speed and VRAM consumption make it practical for interactive applications.

I did notice occasional issues in extended passages, such as minor repetition or misplaced emphasis. These were not frequent but appeared more in long-form prompts than in short ones. Tuning presets and adjusting text input (for example, punctuation for phrasing) can help.

Observed Strengths

Quick synthesis for both short and long prompts
Clear pronunciation and expressive delivery
Good multilingual performance across the listed languages

Observed Limitations

Minor errors may appear in longer passages
Some presets can introduce noticeable style artifacts if pushed hard

Long-Form Synthesis

KaniTTS can handle paragraph-length text gracefully. Response times remain reasonable, and the output remains coherent across sentences. Quality is consistent with short-form results, though longer content increases the chance of small errors.

If you plan to produce multi-paragraph narration, consider practical steps:

Split very long passages into smaller segments
Add punctuation to guide phrasing
Select presets that prioritize stability for lengthier content

Running the FastAPI Demo

The provided demo app makes it simple to test the model locally. Once it’s running:

Enter text in the prompt field
Choose a language or voice preset
Trigger synthesis and listen to the generated audio
Download clips for offline checking or comparison

For measurement during stress testing:

Keep a GPU monitoring tool open to track VRAM use
Note time to first audio and completion time for long passages
Compare presets and languages for differences in latency and memory

VRAM and Throughput Summary

Startup with the web demo: comfortable headroom on a 48 GB GPU
Synthesis on short prompts: ~3.6 GB VRAM in my session
Reported throughput reference (RTX 5080): ~15 seconds of audio per second at ~2 GB VRAM

These figures point to an efficient model that scales well for real-time needs. If you are targeting very low-latency pipelines, you can prioritize shorter input chunks, preselected presets, and a single-GPU deployment for serving.

Fine-Tuning and Extensibility

The team has released a base model suitable for fine-tuning. This is valuable if you want:

A custom voice tuned to specific data
Domain-specific pronunciation and jargon handling
Style controls aligned with brand or product voice

With Apache 2.0 licensing, integration into internal tooling and products is straightforward. Keep training datasets clean and well-labeled, and document your fine-tuning procedure to maintain reproducibility.

Practical Fine-Tuning Pointers

Prepare transcribed audio with consistent recording quality
Use balanced datasets to avoid undesirable style bias
Validate intermediate checkpoints on both short and long-form text
Monitor WER and subjective MOS during iteration

Building for Low Latency

KaniTTS is built with responsiveness in mind. For production systems:

Keep the model warm and pinned to GPU memory
Reuse a single process with a small worker pool
Minimize text preprocessing latency and cache common prompts
Consider batching only if it does not add noticeable delay

These strategies help maintain the real-time feel that KaniTTS targets.

Practical Tips for Daily Use

Choose presets that match your application’s tone
For multilingual apps, preselect voices per language to avoid overhead
Save output clips to compare versions when tuning prompts
Keep an eye on VRAM through your preferred monitoring tool
For long passages, add punctuation and paragraph breaks to guide rhythm

Stability and Style Control

Expressive models can occasionally produce stutter or irregular prosody, especially with highly emotive presets. If you encounter instability:

Switch to a more neutral preset for long passages
Reduce exclamation-heavy phrasing and excessive emphasis markers
Break long inputs into sensible chunks and review per segment

This balance preserves expressivity while reducing the chance of artifacts.

Security and Deployment Considerations

Validate text inputs server-side to avoid injection into prompt fields
Limit maximum input length per request to protect performance
Log synthesis times and error rates for ongoing monitoring
Version your voices, presets, and configuration for consistent outputs

These guardrails help keep services reliable as usage scales.

Measured Outcomes

From my run:

The web UI initialized cleanly after installation
VRAM usage was around 3.6 GB during active synthesis
Synthesis speed felt fast and responsive across presets
Multilingual voices produced clear results for short inputs
Long-form text worked well with occasional minor errors

These observations align with the model’s stated goals: fast generation, efficient memory use, and expressive delivery across several languages.

When to Consider KaniTTS

Interactive voice interfaces that need quick turnaround
Multilingual apps covering English, German, Chinese, Cantonese, Korean, Arabic, and Spanish
Projects that benefit from expressive presets with controllable style
Deployments that require permissive licensing and fine-tuning options

Conclusion

KaniTTS sets a strong bar for real-time, multilingual TTS. The pairing of an LLM-style tokenizer and a neural audio codec produces clean 22 kHz audio with fast responses and moderate VRAM needs. In testing, it ran smoothly, switched voices quickly, and delivered expressive speech with minimal friction.

The reported metrics—MOS around 4.3 and word error under 5%—match the listening experience in short-form use. Long-form synthesis is also solid, with minor issues that can be mitigated through preset selection and prompt design. The availability of a fine-tunable base model under Apache 2.0 makes it a practical choice for both experiments and production.