Klear-46B by (Kwai): Local Install Guide

I set out to install and test Klear-46B MoE locally, then push it through practical tasks: code generation, multilingual translation, and high-pressure strategic writing. The goal was to see how far a sparse Mixture-of-Experts model can go on a single high-memory GPU while staying efficient and responsive.

Klear-46B MoE wakes only a fraction of its total parameters for each token. It promises dense-model quality without dense-model cost. I ran it with an OpenAI-compatible API via vLLM, kept the setup simple, and evaluated results directly.

What Is Klear-46B MoE?

Klear-46B MoE is a sparse Mixture-of-Experts large language model from Kwai’s Klear team. It combines a large total parameter count with selective activation to save memory and compute per token.

Total parameters: 46B, with about 2.5B active per token during inference
Variants: Base and instruction-tuned (DPO); a reasoning-enhanced variant is planned
Training: 22T tokens via a progressive curriculum
- Broadly filtered web data to start
- Gradual ramp-up of math, coding, and STEM to sharpen reasoning
- Final stretch with synthetic, reasoning-heavy, long-context material

The design targets real workloads with long context, improved reasoning, and better deployment economics.

Table Overview: Klear-46B MoE

Attribute	Detail
Model Type	Sparse Mixture-of-Experts (MoE)
Total Parameters	46B
Active Parameters per Token	~2.5B
Experts	256 in feed-forward blocks
Activated per Layer	8 experts + 1 shared/router expert
Layers	32
Hidden Size	~2k
Attention Heads	32
KV Heads	4
Attention	Grouped query attention (GQA)
Vocabulary	~151k
MoE Intermediate Size	≈8.96k per expert (approx.)
Context Handling	Trained with long-context material
Variants	Base, Instruct (DPO), reasoning-enhanced variant planned
Inference Engine (tested)	vLLM
Local Hardware (tested)	1× NVIDIA H100 80GB
VRAM Consumption (observed)	~79GB (~97% of 80GB)
Shards (download)	19
API Pattern	OpenAI-compatible

Key Features of Klear-46B MoE

Sparse activation with high capacity
- 46B total parameters; about 2.5B active per token through expert routing
- Memory- and compute-friendly at inference time
Practical local deployment
- Runs on a single H100 80GB in my test
- Works with vLLM and standard OpenAI-style APIs
Instruction-tuned option
- Direct Preference Optimization (DPO) for helpful, aligned responses
- Reasoning-focused variant is on the roadmap
Long-form context readiness
- Trained with long-context data
- Large vocabulary (~151k) for broader token coverage
Balanced design
- Dense-level quality targets with lower per-token activation
- Architecture built for real workloads and measurable outputs

How Klear-46B MoE Works

Klear-46B MoE is a Transformer with MoE feed-forward blocks. The attention stack operates as expected, while feed-forward layers contain many experts with routing logic that picks a subset for each token.

Experts and routing
- 256 total experts live inside the feed-forward stack
- Per layer, 8 experts activate plus one shared/router expert
- This selective activation keeps per-token compute closer to 2.5B parameters
Attention and token flow
- 32 layers, ~2k hidden size, 32 attention heads, and 4 KV heads
- Grouped query attention for efficient memory and compute on KV caches
Vocabulary and capacity
- ~151k tokens support a wide range of inputs, including long context
- MoE intermediate size around 8.96k per expert aids capacity without full dense cost

This design cuts memory needs while preserving model expressiveness, making single-GPU deployment viable for heavyweight tasks.

How to Use Klear-46B MoE

Setup and Install

I ran the instruction-tuned model on Ubuntu with one NVIDIA H100 80GB using vLLM.

Prerequisites
- Ubuntu machine with a single H100 80GB GPU
- vLLM installed
- Network access for downloading 19 shards on first run
Serve the model
- Start the model server with vLLM
- Expose an endpoint on port 8000
- Use one GPU for this test
- Wait for 19 shards to download on first run
Monitor GPU
- VRAM usage reached ~79GB (~97% of the 80GB card) once the model finished loading
- Keep a terminal open to confirm the server is ready for inference

API Pattern

Once the server was live, I used an OpenAI-compatible API pattern:

Define a system prompt (e.g., “You are a helpful coding assistant”)
Send user prompts to the endpoint
Receive standard chat completions

This keeps integration simple if you already work with OpenAI-style clients and libraries.

Real-World Checks

Code Generation: p5.js Interactive HTML

I asked the model to write a single self-contained HTML file using p5.js that shows a colorful animated paper airplane moving dynamically across the screen in random directions. I also wanted responsive behavior on click events.

Setup
- System prompt: helpful coding assistant
- User prompt: generate a single HTML file with p5.js for a dynamic paper airplane animation
Observations
- Generation was fast under vLLM
- The produced HTML ran cleanly in the browser
- The animation behaved as requested, with convincing motion, trails, and click-triggered effects

This test checked code synthesis, adherence to single-file constraints, and correct animation behavior without extra assets.

Multilingual Check: Wide Translation Sweep

Next, I tested translation breadth. The prompt asked for translations of the sentence “This clinging to the past is like holding on to smoke” into many widely spoken languages, along with several Scandinavian languages and a gibberish placeholder.

Setup
- Single prompt requesting multi-language output
- No prior system-specific guidance about multilingual ability
Observations
- The model produced translations across many languages
- Some entries showed script or wording errors
- Certain outputs looked off due to font or rendering limitations
- No explicit claim of multilingual training was provided, but coverage was reasonably broad

Expect mixed accuracy across languages that weren’t clearly mentioned in training. For production, spot-check translations with native speakers or validators.

Strategic Writing Under Pressure: Executive Memo as a Business Fable

For the final test, I challenged the model with a structured, high-pressure writing task: act as a Chief Strategy Officer at a mid-market B2B SaaS company and write an internal memo that doubles as a business fable. The memo needed to synthesize messy signals into a defensible plan, blend narrative with numbers, and show ethical judgment on AI transparency versus speed. The prompt requested a one- to two-page story set in a bustling port city, with concrete milestones, metrics, and trade-offs.

Setup
- Narrative brief with explicit constraints and business context
- Requirements for a TL;DR, measurable targets, risk mitigation, and clear alternatives
Observations
- The memo opened with a titled fable and clear TL;DR
- It stated the strategic crossroads, including slowing growth, margin pressure, and a new competitor
- It provided explicit trade-offs across options with head-to-head comparisons
- It included a 12-month roadmap, measurable targets, and risk controls
- The narrative tone stayed professional and confident while retaining concrete data

This task stress-tested reasoning, structure, number sense, and narrative clarity. The output had the right artifacts: TL;DR, alternatives, trade-offs, timeline, metrics, risks, and a clear voice.

Step-by-Step: From Install to First Responses

H4: Prepare the Environment

Use an Ubuntu machine with an NVIDIA H100 80GB GPU.
Install vLLM following official guidance.
Ensure Python and a recent CUDA toolkit are available.
Open necessary ports (e.g., 8000) for local API access.

H4: Serve the Model

Launch vLLM to serve Klear-46B MoE on port 8000.
On first run, allow time for 19 shards to download.
Verify model load completion in the terminal.
Check VRAM usage; expect ~79GB consumption on an 80GB card.

H4: Call the API

Use an OpenAI-compatible client or a simple HTTP request.
Set a helpful system prompt tailored to your use case.
Send a clear user prompt for the task at hand (code, translation, or strategy).
Save and review outputs; for code, run locally to confirm behavior.

Practical Notes and Observations

Performance Feel

Startup and download take a while on first run due to 19 shards
Once loaded, inference felt snappy via vLLM
The model handled code generation and structured prose without hesitation

Memory and Activation

Sparse expert activation kept actual per-token compute contained
VRAM still peaked near the card’s limit, reflecting the model’s overall scale
Single-GPU deployment was manageable with the tested setup

Output Quality Across Tasks

Coding: Clean file structure and animation logic in a single HTML file
Translation: Broad coverage with some inaccuracies; not advertised as multilingual but performed reasonably
Strategy Memo: Clear logic, structured trade-offs, metrics, and timelines

FAQs

Is Klear-46B MoE multilingual?

The model’s card did not explicitly state multilingual guarantees in the script. In testing, it produced translations across many languages with mixed accuracy. For production use, validate with native speakers or external evaluators.

What are the main variants?

Two are directly referenced: a base model and an instruction-tuned model trained with DPO. A reasoning-enhanced variant is planned.

How many experts are active per token?

There are 256 total experts in the feed-forward blocks. During inference, 8 experts plus one shared/router expert activate per layer. This keeps active parameters around 2.5B per token.

What attention and architecture details stand out?

32 Transformer layers
~2k hidden size
32 attention heads, 4 KV heads
Grouped query attention for efficiency
Vocabulary around 151k tokens
MoE intermediate size ≈8.96k per expert

What hardware did you run?

A single NVIDIA H100 80GB GPU. VRAM usage was about 79GB after the model fully loaded.

Does it work with OpenAI-style APIs?

Yes. I ran it with an OpenAI-compatible API through vLLM, which made integration straightforward with standard clients.

How does it handle long context?

The training curriculum explicitly included long-context material. While an exact limit wasn’t quoted in the script, the model handled multi-part prompts and structured outputs competently during testing.

How was the model trained?

On 22 trillion tokens with a staged curriculum: broadly filtered web content first, then heavier math/coding/STEM for reasoning, and finally synthetic, reasoning-heavy, long-context data to finish.

What are the real-world strengths observed?

High-quality code generation, following constraints like single-file output
Structured, quantified strategic writing with a clear TL;DR, trade-offs, and a timeline
Broad multilingual attempts with noted imperfections

Conclusion

Klear-46B MoE managed to run locally on a single H100 80GB GPU and delivered strong results across code synthesis, wide translation attempts, and structured executive writing. Its MoE design keeps per-token activation modest, with routing that wakes a limited set of experts while holding a large total capacity in reserve.

The instruction-tuned variant responded well to OpenAI-style prompts and stayed on brief for each test. Code generation was brisk and correct, the translation sweep showed breadth with some misses, and the strategic memo combined a clear TL;DR, quantifiable targets, and explicit trade-offs in a confident voice.

If you need a large model that can be hosted on one high-memory GPU, Klear-46B MoE is a practical option to evaluate. The architecture supports real workloads, the API pattern is familiar, and the outputs in coding and structured writing are strong. With a reasoning-focused variant planned, it’s positioned to handle demanding tasks where clarity, structure, and measurable outcomes matter.