Sonu Sahani logo
Sonusahani.com
AI

KaniTTS Real‑Time Text‑to‑Speech

KaniTTS Real‑Time Text‑to‑Speech
0 views
10 min read
#AI

Today I’m installing and stress-testing KaniTTS, a 370M-parameter real-time text-to-speech model designed for low latency and a small memory footprint. It pairs an LLM-style tokenizer with a neural audio codec to generate clean 22 kHz speech quickly.

The published performance targets are ambitious: about 1 second to generate 15 seconds of audio on an RTX 5080, using roughly 2 GB of VRAM. Reported quality lands around MOS 4.3 with word error under 5%, which is considered strong for real-time synthesis. My goal here is to set it up, run the web demo, measure VRAM use during synthesis, and assess voice quality across multiple languages and presets.

What is KaniTTS?

KaniTTS is a multilingual TTS model aimed at responsive voice experiences. It is built around two core ideas: a tokenizer inspired by modern language models for text processing and a neural audio codec for efficient, high-quality waveform generation at 22 kHz.

The model focuses on fast synthesis, expressive presets, and broad language coverage. It is available under the Apache 2.0 license, and a base model is provided for fine-tuning. This makes it well-suited for conversational agents, voice interfaces, and any application that benefits from low-latency speech generation.

Overview of KaniTTS

AspectDetails
Model size~370M parameters
Output22 kHz speech
Core componentsLLM-style tokenizer + neural audio codec
Target latency~1s to generate 15s of audio on an RTX 5080 (reported)
Reported memory use~2 GB VRAM for the above throughput (reported)
LicenseApache 2.0
Supported languagesEnglish, German, Chinese (Mandarin), Cantonese, Korean, Arabic, Spanish
PresetsMultiple voice and style presets (e.g., British, English female, “legend mode”)
InterfaceFastAPI web demo; local host serving on port 7860
Fine-tuningBase model available for custom training

Key Features of KaniTTS

  • Real-time synthesis with consistent latency targets
  • Efficient memory use for GPU deployment
  • 22 kHz output that sounds clear and expressive
  • Multilingual support across key languages
  • Voice and style presets for quick customization
  • Apache 2.0 license and base model for fine-tuning

Test Setup and Goals

I’m using a workstation with an NVIDIA RTX A6000 (48 GB VRAM). The aim is to:

  • Create a fresh environment and install prerequisites
  • Clone the KaniTTS repository
  • Launch the FastAPI web demo and let the model download
  • Measure VRAM consumption during synthesis
  • Evaluate response time and output quality across several voices and languages

Environment Preparation

  • Create and activate a new Python virtual environment
  • Install system and Python dependencies required by the project
  • Verify GPU visibility and driver/toolkit readiness

Repository and Web Demo

  • Clone the KaniTTS repository
  • Install the project’s Python requirements
  • Start the FastAPI demo server
  • Wait for the model to download on first run
  • Access the web UI at localhost on port 7860

Installation and Launch

Step-by-Step Guide

  1. Create a virtual environment
  • Set up a clean Python environment to avoid dependency conflicts.
  1. Install prerequisites
  • Install the required Python packages listed by the repository.
  1. Clone the repository
  • Fetch the KaniTTS codebase from its Git source.
  1. Launch the FastAPI demo
  • Start the server and confirm the app is serving on localhost:7860.
  • Allow time for the model weights to download on first run.

First Launch Notes

  • The web UI initializes quickly once dependencies are in place.
  • Model weights are fetched automatically by the demo script.
  • After download, the interface becomes responsive and ready for text prompts.

First Run and Performance Observations

The demo ran smoothly on the A6000. During synthesis, VRAM usage settled around 3.6 GB in my session. The model responded quickly to short prompts, and the generated audio felt snappy.

Response time was short enough to encourage interactive use. While per-prompt latency varies by preset and text length, turn-around stayed within a responsive range for live testing. The output was clean at 22 kHz, with strong articulation and styling.

Latency and Throughput

  • Reported reference: ~1 second to generate ~15 seconds of audio on an RTX 5080
  • Practical feel in the demo: quick initiation and fast completion for short prompts
  • Long-form synthesis also completed promptly, though more text introduces added processing time

Memory Footprint

  • Published target: ~2 GB VRAM for the reported throughput on an RTX 5080
  • Observed during testing: about 3.6 GB VRAM use on an A6000 with the web demo running

Multilingual Voices and Presets

KaniTTS includes multiple voices and presets across English, German, Chinese (Mandarin), Cantonese, Korean, Arabic, and Spanish. Switching voices is immediate in the web UI, and synthesis remains fast.

I evaluated several English presets (including British and female voices) and then cycled through available options in German, Mandarin, Cantonese, Korean, Arabic, and Spanish. Across languages, pronunciation quality was strong for short passages, and the model kept its responsive feel.

Language Coverage

  • English
  • German
  • Chinese (Mandarin)
  • Cantonese
  • Korean
  • Arabic
  • Spanish

Voice Presets and Style Controls

  • The catalog includes multiple presets with distinct timbres and styles.
  • “Legend mode” and other expressive presets are included.
  • Emotional coloration is present; outputs can convey emphasis and energy.

Expressivity and Artifacts

  • Expressivity is strong; the model conveys emotions and emphasis well.
  • Certain presets may introduce stutter or prosody drift in rare cases.
  • For conversational agents, that expressivity can add realism, but monitoring and prompt design help keep delivery consistent.

Quality Notes from Testing

The overall impression is positive: articulation is clear, the noise floor is low, and words are easy to follow. Short prompts consistently sounded natural and expressive. The speed and VRAM consumption make it practical for interactive applications.

I did notice occasional issues in extended passages, such as minor repetition or misplaced emphasis. These were not frequent but appeared more in long-form prompts than in short ones. Tuning presets and adjusting text input (for example, punctuation for phrasing) can help.

Observed Strengths

  • Quick synthesis for both short and long prompts
  • Clear pronunciation and expressive delivery
  • Good multilingual performance across the listed languages

Observed Limitations

  • Minor errors may appear in longer passages
  • Some presets can introduce noticeable style artifacts if pushed hard

Long-Form Synthesis

KaniTTS can handle paragraph-length text gracefully. Response times remain reasonable, and the output remains coherent across sentences. Quality is consistent with short-form results, though longer content increases the chance of small errors.

If you plan to produce multi-paragraph narration, consider practical steps:

  • Split very long passages into smaller segments
  • Add punctuation to guide phrasing
  • Select presets that prioritize stability for lengthier content

Running the FastAPI Demo

The provided demo app makes it simple to test the model locally. Once it’s running:

  • Enter text in the prompt field
  • Choose a language or voice preset
  • Trigger synthesis and listen to the generated audio
  • Download clips for offline checking or comparison

For measurement during stress testing:

  • Keep a GPU monitoring tool open to track VRAM use
  • Note time to first audio and completion time for long passages
  • Compare presets and languages for differences in latency and memory

VRAM and Throughput Summary

  • Startup with the web demo: comfortable headroom on a 48 GB GPU
  • Synthesis on short prompts: ~3.6 GB VRAM in my session
  • Reported throughput reference (RTX 5080): ~15 seconds of audio per second at ~2 GB VRAM

These figures point to an efficient model that scales well for real-time needs. If you are targeting very low-latency pipelines, you can prioritize shorter input chunks, preselected presets, and a single-GPU deployment for serving.

Fine-Tuning and Extensibility

The team has released a base model suitable for fine-tuning. This is valuable if you want:

  • A custom voice tuned to specific data
  • Domain-specific pronunciation and jargon handling
  • Style controls aligned with brand or product voice

With Apache 2.0 licensing, integration into internal tooling and products is straightforward. Keep training datasets clean and well-labeled, and document your fine-tuning procedure to maintain reproducibility.

Practical Fine-Tuning Pointers

  • Prepare transcribed audio with consistent recording quality
  • Use balanced datasets to avoid undesirable style bias
  • Validate intermediate checkpoints on both short and long-form text
  • Monitor WER and subjective MOS during iteration

Building for Low Latency

KaniTTS is built with responsiveness in mind. For production systems:

  • Keep the model warm and pinned to GPU memory
  • Reuse a single process with a small worker pool
  • Minimize text preprocessing latency and cache common prompts
  • Consider batching only if it does not add noticeable delay

These strategies help maintain the real-time feel that KaniTTS targets.

Practical Tips for Daily Use

  • Choose presets that match your application’s tone
  • For multilingual apps, preselect voices per language to avoid overhead
  • Save output clips to compare versions when tuning prompts
  • Keep an eye on VRAM through your preferred monitoring tool
  • For long passages, add punctuation and paragraph breaks to guide rhythm

Stability and Style Control

Expressive models can occasionally produce stutter or irregular prosody, especially with highly emotive presets. If you encounter instability:

  • Switch to a more neutral preset for long passages
  • Reduce exclamation-heavy phrasing and excessive emphasis markers
  • Break long inputs into sensible chunks and review per segment

This balance preserves expressivity while reducing the chance of artifacts.

Security and Deployment Considerations

  • Validate text inputs server-side to avoid injection into prompt fields
  • Limit maximum input length per request to protect performance
  • Log synthesis times and error rates for ongoing monitoring
  • Version your voices, presets, and configuration for consistent outputs

These guardrails help keep services reliable as usage scales.

Measured Outcomes

From my run:

  • The web UI initialized cleanly after installation
  • VRAM usage was around 3.6 GB during active synthesis
  • Synthesis speed felt fast and responsive across presets
  • Multilingual voices produced clear results for short inputs
  • Long-form text worked well with occasional minor errors

These observations align with the model’s stated goals: fast generation, efficient memory use, and expressive delivery across several languages.

When to Consider KaniTTS

  • Interactive voice interfaces that need quick turnaround
  • Multilingual apps covering English, German, Chinese, Cantonese, Korean, Arabic, and Spanish
  • Projects that benefit from expressive presets with controllable style
  • Deployments that require permissive licensing and fine-tuning options

Conclusion

KaniTTS sets a strong bar for real-time, multilingual TTS. The pairing of an LLM-style tokenizer and a neural audio codec produces clean 22 kHz audio with fast responses and moderate VRAM needs. In testing, it ran smoothly, switched voices quickly, and delivered expressive speech with minimal friction.

The reported metrics—MOS around 4.3 and word error under 5%—match the listening experience in short-form use. Long-form synthesis is also solid, with minor issues that can be mitigated through preset selection and prompt design. The availability of a fine-tunable base model under Apache 2.0 makes it a practical choice for both experiments and production.

Related Posts