Sonu Sahani logo
Sonusahani.com
TTS

Step Audio 2 Mini: New TTS launched by StepFun AI

Step Audio 2 Mini: New TTS launched by StepFun AI
0 views
7 min read
#TTS

Step Audio 2 Mini: Complete Guide and Hands-On Test

In less than 24 hours since I covered GPT real time, a company called Step Fun from China has released the Step Audio 2 Mini model. This is an 8 billion parameter speech-to-speech model that has managed to outperform GPT-4 Audio in certain benchmarks. What makes it even more attractive is that it is completely released under the Apache 2 license.

In this article, I will walk you through everything from installation to testing, the architecture, features, real examples, and performance observations. I'll also share my hands-on results while using it on an Ubuntu system with a powerful GPU.


What is Step Audio 2 Mini?

Step Audio 2 Mini is a speech-to-speech model designed by Step Fun. It integrates audio directly into the language modeling pipeline, which means it can process and generate both text and audio tokens in a unified sequence.

At its core, it starts with a frozen pre-trained audio encoder, then passes outputs through an audio adapter, and finally into a large language model (LLM) decoder. The result is a system capable of true end-to-end voice interaction.


Table Overview

Here's a quick table summarizing the essential information about Step Audio 2 Mini:

FeatureDetails
Model NameStep Audio 2 Mini
DeveloperStep Fun (China)
Parameters8 Billion
LicenseApache 2
TypeSpeech-to-Speech (S2S)
EncoderFrozen pre-trained audio encoder
AdapterDownsamples to ~12.5 Hz for efficiency
DecoderLLM decoder with mixed text-audio handling
TokenizerCosyVoice2-based tokenizer + HiFi-GAN vocoder
Key AbilitiesVoice interaction, context-aware responses, tool calling, multimodal RAG
GPU Tested OnNvidia A6000 RTX 48 GB VRAM
Demo InterfaceGradio (port 7860)

Key Features of Step Audio 2 Mini

  • Speech-to-Speech Real-Time Capability: Accepts voice input and generates spoken voice output.

  • Unified Sequence Processing: Handles both text and audio tokens in one pipeline.

  • Architecture:

    • Frozen pre-trained audio encoder.
    • Audio adapter that downsamples for efficiency.
    • LLM decoder trained to handle mixed audio and text.
    • Flow-matching module and HiFi-GAN vocoder for waveform synthesis.
  • Multimodal RAG (Retrieval-Augmented Generation): Can fetch real-time information such as weather, time, or even web content.

  • Dynamic Voice Features: Access to a large voice library to mimic different speaking styles or switch vocal tones.

  • Open License: Released under Apache 2, which makes it free to use for research and commercial purposes.

  • Context-Aware Conversations: Maintains context across sessions with intelligent dialogue.


Step-by-Step Guide: Installing Step Audio 2 Mini

I installed Step Audio 2 Mini on Ubuntu with a Nvidia RTX A6000 (48 GB VRAM). Here's the process step by step:

1. Create a Virtual Environment

First, I created a virtual environment to keep everything clean and isolated.

python3 -m venv step-audio-env
source step-audio-env/bin/activate

2. Clone the Repository

Next, I cloned the base repository and then the mini version inside it.

git clone <base_repo>
cd <base_repo>
git clone <step_audio_mini_repo>

3. Install Dependencies

Now, I installed the required packages:

  • Transformers
  • Torch Audio
  • Gradio
  • Other supporting libraries
pip install -r requirements.txt

4. Launch Gradio Demo

Once installation completed, I launched the Gradio demo from the repo root:

python app.py

This started the demo on localhost:7860.


Testing Step Audio 2 Mini

When I first launched the demo, it displayed a Chinese system prompt. After translating it into English, it introduced the voice model. The default chatbot identity was called Shavee, trained by StepAR Company, designed with delicate emotional understanding and observational skills. It instructed to communicate in the default female voice.

From here, I tested different features.


Speech-to-Speech Test 1: Uploading Audio

I uploaded an audio file asking: "What is happiness?"

The response generated was:

"Happiness to me is that warm feeling when you're with people you love and really care about. It's that sweet moment when you see someone smile or hear a voice and suddenly your heart just feels lighter…"

This response was natural, emotional, and meaningful. The VRAM consumption was about 24 GB and response time was ~30 seconds.


Speech-to-Speech Test 2: Emotional Audio

I then uploaded a long, emotion-filled audio clip about struggling for years, finally succeeding, and feeling grateful. The model's response was empathetic:

"Wow, this is absolutely amazing. 3 years of hard work and perseverance have finally paid off. Congratulations… This is truly a life-changing moment."

The response quality was excellent, but the latency was close to 15 minutes. This is a serious limitation for real-time conversations.


Speech-to-Speech Test 3: Real-Time Question

I asked: "What is the weather in Sydney, Australia today?"

The model responded:

"Today in Sydney, Australia, it's mostly sunny with a temperature of about 30°C. The wind is blowing at about 10 km/h."

This was surprisingly accurate and shows that the model can fetch real-time data.


Retrieval-Augmented Generation (RAG) Test

Finally, I tested if it could use custom context. I provided additional information about myself (Fad Midsa) and asked it to recall. The model successfully remembered and gave contextual answers:

"Sure. I'd love to check out his channel. It sounds really interesting. What topics do they usually cover?"

This confirms that multimodal RAG works as intended.


Performance Observations

  • Quality: Voice responses were natural, emotional, and context-aware.

  • Latency: Real-time capability is still limited. Some queries took up to 15 minutes.

  • VRAM Usage: Around 24 GB of GPU memory was required for running.

  • Strengths:

    • High quality speech-to-speech responses.
    • Ability to handle emotions.
    • Retrieval and tool-calling capability.
  • Weaknesses:

    • Response time is slow for long or complex queries.
    • Microphone input sometimes not detected properly (likely environment-specific).

FAQs About Step Audio 2 Mini

Q1: What is Step Audio 2 Mini? It is an 8B speech-to-speech model by Step Fun, capable of text and audio processing in one sequence.

Q2: Is it free to use? Yes, it is licensed under Apache 2, so it's free for research and commercial projects.

Q3: What GPU is needed to run it? A high-memory GPU (24 GB VRAM or more) is recommended. It worked on my Nvidia RTX A6000 with 48 GB VRAM.

Q4: Can it fetch real-time information? Yes, using multimodal RAG, it can provide data like weather, time, or web content.

Q5: How long does it take to respond? Simple queries take about 30 seconds, but complex, emotion-filled audios may take up to 15 minutes.

Q6: Can I change the voice output style? Yes, it supports switching between different speaking styles and vocal tones dynamically.


Final Thoughts

Step Audio 2 Mini is an impressive open-source speech-to-speech model. While the response quality is very high, the latency is a big drawback for now. It shows huge potential for future conversational AI systems, especially since it's Apache licensed.

If the developers manage to optimize the response speed, Step Audio 2 Mini could become one of the most practical and powerful speech-to-speech AI models available today.

Related Posts