Step Audio 2 Mini: New TTS launched by StepFun AI

Table Of Content
- Step Audio 2 Mini: Complete Guide and Hands-On Test
- What is Step Audio 2 Mini?
- Table Overview
- Key Features of Step Audio 2 Mini
- Step-by-Step Guide: Installing Step Audio 2 Mini
- 1. Create a Virtual Environment
- 2. Clone the Repository
- 3. Install Dependencies
- 4. Launch Gradio Demo
- Testing Step Audio 2 Mini
- Speech-to-Speech Test 1: Uploading Audio
- Speech-to-Speech Test 2: Emotional Audio
- Speech-to-Speech Test 3: Real-Time Question
- Retrieval-Augmented Generation (RAG) Test
- Performance Observations
- FAQs About Step Audio 2 Mini
- Final Thoughts
Step Audio 2 Mini: Complete Guide and Hands-On Test
In less than 24 hours since I covered GPT real time, a company called Step Fun from China has released the Step Audio 2 Mini model. This is an 8 billion parameter speech-to-speech model that has managed to outperform GPT-4 Audio in certain benchmarks. What makes it even more attractive is that it is completely released under the Apache 2 license.
In this article, I will walk you through everything from installation to testing, the architecture, features, real examples, and performance observations. I'll also share my hands-on results while using it on an Ubuntu system with a powerful GPU.
What is Step Audio 2 Mini?
Step Audio 2 Mini is a speech-to-speech model designed by Step Fun. It integrates audio directly into the language modeling pipeline, which means it can process and generate both text and audio tokens in a unified sequence.
At its core, it starts with a frozen pre-trained audio encoder, then passes outputs through an audio adapter, and finally into a large language model (LLM) decoder. The result is a system capable of true end-to-end voice interaction.
Table Overview
Here's a quick table summarizing the essential information about Step Audio 2 Mini:
| Feature | Details |
|---|---|
| Model Name | Step Audio 2 Mini |
| Developer | Step Fun (China) |
| Parameters | 8 Billion |
| License | Apache 2 |
| Type | Speech-to-Speech (S2S) |
| Encoder | Frozen pre-trained audio encoder |
| Adapter | Downsamples to ~12.5 Hz for efficiency |
| Decoder | LLM decoder with mixed text-audio handling |
| Tokenizer | CosyVoice2-based tokenizer + HiFi-GAN vocoder |
| Key Abilities | Voice interaction, context-aware responses, tool calling, multimodal RAG |
| GPU Tested On | Nvidia A6000 RTX 48 GB VRAM |
| Demo Interface | Gradio (port 7860) |
Key Features of Step Audio 2 Mini
-
Speech-to-Speech Real-Time Capability: Accepts voice input and generates spoken voice output.
-
Unified Sequence Processing: Handles both text and audio tokens in one pipeline.
-
Architecture:
- Frozen pre-trained audio encoder.
- Audio adapter that downsamples for efficiency.
- LLM decoder trained to handle mixed audio and text.
- Flow-matching module and HiFi-GAN vocoder for waveform synthesis.
-
Multimodal RAG (Retrieval-Augmented Generation): Can fetch real-time information such as weather, time, or even web content.
-
Dynamic Voice Features: Access to a large voice library to mimic different speaking styles or switch vocal tones.
-
Open License: Released under Apache 2, which makes it free to use for research and commercial purposes.
-
Context-Aware Conversations: Maintains context across sessions with intelligent dialogue.
Step-by-Step Guide: Installing Step Audio 2 Mini
I installed Step Audio 2 Mini on Ubuntu with a Nvidia RTX A6000 (48 GB VRAM). Here's the process step by step:
1. Create a Virtual Environment
First, I created a virtual environment to keep everything clean and isolated.
python3 -m venv step-audio-env
source step-audio-env/bin/activate2. Clone the Repository
Next, I cloned the base repository and then the mini version inside it.
git clone <base_repo>
cd <base_repo>
git clone <step_audio_mini_repo>3. Install Dependencies
Now, I installed the required packages:
- Transformers
- Torch Audio
- Gradio
- Other supporting libraries
pip install -r requirements.txt4. Launch Gradio Demo
Once installation completed, I launched the Gradio demo from the repo root:
python app.pyThis started the demo on localhost:7860.
Testing Step Audio 2 Mini
When I first launched the demo, it displayed a Chinese system prompt. After translating it into English, it introduced the voice model. The default chatbot identity was called Shavee, trained by StepAR Company, designed with delicate emotional understanding and observational skills. It instructed to communicate in the default female voice.
From here, I tested different features.
Speech-to-Speech Test 1: Uploading Audio
I uploaded an audio file asking: "What is happiness?"
The response generated was:
"Happiness to me is that warm feeling when you're with people you love and really care about. It's that sweet moment when you see someone smile or hear a voice and suddenly your heart just feels lighter…"
This response was natural, emotional, and meaningful. The VRAM consumption was about 24 GB and response time was ~30 seconds.
Speech-to-Speech Test 2: Emotional Audio
I then uploaded a long, emotion-filled audio clip about struggling for years, finally succeeding, and feeling grateful. The model's response was empathetic:
"Wow, this is absolutely amazing. 3 years of hard work and perseverance have finally paid off. Congratulations… This is truly a life-changing moment."
The response quality was excellent, but the latency was close to 15 minutes. This is a serious limitation for real-time conversations.
Speech-to-Speech Test 3: Real-Time Question
I asked: "What is the weather in Sydney, Australia today?"
The model responded:
"Today in Sydney, Australia, it's mostly sunny with a temperature of about 30°C. The wind is blowing at about 10 km/h."
This was surprisingly accurate and shows that the model can fetch real-time data.
Retrieval-Augmented Generation (RAG) Test
Finally, I tested if it could use custom context. I provided additional information about myself (Fad Midsa) and asked it to recall. The model successfully remembered and gave contextual answers:
"Sure. I'd love to check out his channel. It sounds really interesting. What topics do they usually cover?"
This confirms that multimodal RAG works as intended.
Performance Observations
-
Quality: Voice responses were natural, emotional, and context-aware.
-
Latency: Real-time capability is still limited. Some queries took up to 15 minutes.
-
VRAM Usage: Around 24 GB of GPU memory was required for running.
-
Strengths:
- High quality speech-to-speech responses.
- Ability to handle emotions.
- Retrieval and tool-calling capability.
-
Weaknesses:
- Response time is slow for long or complex queries.
- Microphone input sometimes not detected properly (likely environment-specific).
FAQs About Step Audio 2 Mini
Q1: What is Step Audio 2 Mini? It is an 8B speech-to-speech model by Step Fun, capable of text and audio processing in one sequence.
Q2: Is it free to use? Yes, it is licensed under Apache 2, so it's free for research and commercial projects.
Q3: What GPU is needed to run it? A high-memory GPU (24 GB VRAM or more) is recommended. It worked on my Nvidia RTX A6000 with 48 GB VRAM.
Q4: Can it fetch real-time information? Yes, using multimodal RAG, it can provide data like weather, time, or web content.
Q5: How long does it take to respond? Simple queries take about 30 seconds, but complex, emotion-filled audios may take up to 15 minutes.
Q6: Can I change the voice output style? Yes, it supports switching between different speaking styles and vocal tones dynamically.
Final Thoughts
Step Audio 2 Mini is an impressive open-source speech-to-speech model. While the response quality is very high, the latency is a big drawback for now. It shows huge potential for future conversational AI systems, especially since it's Apache licensed.
If the developers manage to optimize the response speed, Step Audio 2 Mini could become one of the most practical and powerful speech-to-speech AI models available today.
Related Posts
![Character.AI: How to Fix 500 Internal Server Error [Updated]](/_next/image?url=%2Ferror%2Fcharacterai-how-to-fix-500-internal-server-error-updated.webp&w=3840&q=75)
Character.AI: How to Fix 500 Internal Server Error [Updated]
Troubleshooting Character.AI's 500 Internal Server Error: check server status, clear cache/cookies, disable extensions/VPN, and fix connection issues fast.

Kimi Linear by Moonshot AI
Moonshot AI’s Kimi Linear (48B A3B Instruct) introduces a hybrid attention design that makes LLMs faster and more efficient. Learn how it works and why it matters.

NVIDIA OmniVinci: Run Multimodal AI Locally
Learn how to install and run NVIDIA OmniVinci on your PC. We demo its see-read-listen-speak-reason powers and unpack core ideas like Omni-AlignNet.
