🖥 Free Tool

Local LLM Hardware Calculator

Enter your GPU VRAM and system RAM to see which models run in Ollama — at full or quantized precision — with estimated generation speed.

Your Hardware

GPU Preset

System RAM (GB)

GPU VRAM: 24 GBMem BW: 1008 GB/sTotal memory: 56 GB

33Run fully on GPU

18GPU + CPU offload

0RAM only (slow)

-10Out of memory

Show out-of-memory

Model	Ollama tag	Q4_K_M (default)	Q8_0	FP16 full
nomic-embed-text 137Mnomic	nomic-embed-text	⚡ Runs on GPU~1890 tok/s · 0.4 GB	⚡ Runs on GPU~1890 tok/s · 0.4 GB	⚡ Runs on GPU~2520 tok/s · 0.3 GB
mxbai-embed-large 335Mmxbai	mxbai-embed-large	⚡ Runs on GPU~945 tok/s · 0.8 GB	⚡ Runs on GPU~1080 tok/s · 0.7 GB	⚡ Runs on GPU~1080 tok/s · 0.7 GB
Qwen2.5 0.5BQwen2.5	qwen2.5:0.5b	⚡ Runs on GPU~504 tok/s · 1.5 GB	⚡ Runs on GPU~756 tok/s · 1 GB	⚡ Runs on GPU~1080 tok/s · 0.7 GB
Qwen3 0.6BQwen3	qwen3:0.6b	⚡ Runs on GPU~504 tok/s · 1.5 GB	⚡ Runs on GPU~687 tok/s · 1.1 GB	⚡ Runs on GPU~1080 tok/s · 0.7 GB
Llama 3.2 1BLlama 3.2	llama3.2:1b	⚡ Runs on GPU~302 tok/s · 2.5 GB	⚡ Runs on GPU~504 tok/s · 1.5 GB	⚡ Runs on GPU~756 tok/s · 1 GB
Gemma 3 1BGemma 3	gemma3:1b	⚡ Runs on GPU~302 tok/s · 2.5 GB	⚡ Runs on GPU~504 tok/s · 1.5 GB	⚡ Runs on GPU~756 tok/s · 1 GB
DeepSeek-R1 1.5BDeepSeek-R1	deepseek-r1:1.5b	⚡ Runs on GPU~216 tok/s · 3.5 GB	⚡ Runs on GPU~378 tok/s · 2 GB	⚡ Runs on GPU~630 tok/s · 1.2 GB
Qwen3 1.7BQwen3	qwen3:1.7b	⚡ Runs on GPU~199 tok/s · 3.8 GB	⚡ Runs on GPU~360 tok/s · 2.1 GB	⚡ Runs on GPU~540 tok/s · 1.4 GB
Llama 3.2 3BLlama 3.2	llama3.2:3b	⚡ Runs on GPU~116 tok/s · 6.5 GB	⚡ Runs on GPU~216 tok/s · 3.5 GB	⚡ Runs on GPU~344 tok/s · 2.2 GB
Phi-4 Mini 3.8BPhi	phi4-mini:3.8b	⚡ Runs on GPU~89 tok/s · 8.5 GB	⚡ Runs on GPU~176 tok/s · 4.3 GB	⚡ Runs on GPU~270 tok/s · 2.8 GB
Qwen3 4BQwen3	qwen3:4b	⚡ Runs on GPU~84 tok/s · 9 GB	⚡ Runs on GPU~168 tok/s · 4.5 GB	⚡ Runs on GPU~252 tok/s · 3 GB
Gemma 3 4BGemma 3	gemma3:4b	⚡ Runs on GPU~84 tok/s · 9 GB	⚡ Runs on GPU~168 tok/s · 4.5 GB	⚡ Runs on GPU~252 tok/s · 3 GB
Qwen2.5 7BQwen2.5	qwen2.5:7b	⚡ Runs on GPU~50 tok/s · 15 GB	⚡ Runs on GPU~95 tok/s · 8 GB	⚡ Runs on GPU~151 tok/s · 5 GB
Mistral 7BMistral	mistral:7b	⚡ Runs on GPU~49 tok/s · 15.5 GB	⚡ Runs on GPU~95 tok/s · 8 GB	⚡ Runs on GPU~151 tok/s · 5 GB
DeepSeek-R1 7BDeepSeek-R1	deepseek-r1:7b	⚡ Runs on GPU~49 tok/s · 15.5 GB	⚡ Runs on GPU~95 tok/s · 8 GB	⚡ Runs on GPU~151 tok/s · 5 GB
Qwen2.5-Coder 7BQwen2.5-Coder	qwen2.5-coder:7b	⚡ Runs on GPU~50 tok/s · 15 GB	⚡ Runs on GPU~95 tok/s · 8 GB	⚡ Runs on GPU~151 tok/s · 5 GB
Llama 3.1 8BLlama 3	llama3.1:8b	⚡ Runs on GPU~46 tok/s · 16.5 GB	⚡ Runs on GPU~84 tok/s · 9 GB	⚡ Runs on GPU~137 tok/s · 5.5 GB
Qwen3 8BQwen3	qwen3:8b	⚡ Runs on GPU~44 tok/s · 17 GB	⚡ Runs on GPU~84 tok/s · 9 GB	⚡ Runs on GPU~137 tok/s · 5.5 GB
DeepSeek-R1 8BDeepSeek-R1	deepseek-r1:8b	⚡ Runs on GPU~43 tok/s · 17.5 GB	⚡ Runs on GPU~80 tok/s · 9.5 GB	⚡ Runs on GPU~137 tok/s · 5.5 GB
Gemma 3 12BGemma 3	gemma3:12b	🔀 GPU + CPU offload~25 tok/s · 26 GB	⚡ Runs on GPU~56 tok/s · 13.5 GB	⚡ Runs on GPU~95 tok/s · 8 GB
Mistral NeMo 12BMistral	mistral-nemo:12b	🔀 GPU + CPU offload~25 tok/s · 26 GB	⚡ Runs on GPU~56 tok/s · 13.5 GB	⚡ Runs on GPU~95 tok/s · 8 GB
Qwen3 14BQwen3	qwen3:14b	🔀 GPU + CPU offload~20 tok/s · 29.5 GB	⚡ Runs on GPU~49 tok/s · 15.5 GB	⚡ Runs on GPU~80 tok/s · 9.5 GB
Qwen2.5 14BQwen2.5	qwen2.5:14b	🔀 GPU + CPU offload~20 tok/s · 29 GB	⚡ Runs on GPU~50 tok/s · 15 GB	⚡ Runs on GPU~80 tok/s · 9.5 GB
DeepSeek-R1 14BDeepSeek-R1	deepseek-r1:14b	🔀 GPU + CPU offload~19 tok/s · 30 GB	⚡ Runs on GPU~49 tok/s · 15.5 GB	⚡ Runs on GPU~80 tok/s · 9.5 GB
Phi-4 14BPhi	phi4:14b	🔀 GPU + CPU offload~20 tok/s · 29 GB	⚡ Runs on GPU~49 tok/s · 15.5 GB	⚡ Runs on GPU~80 tok/s · 9.5 GB
Codestral 22BCodestral	codestral:22b	🔀 GPU + CPU offload~8 tok/s · 46 GB	🔀 GPU + CPU offload~31 tok/s · 23.5 GB	⚡ Runs on GPU~54 tok/s · 14 GB
Gemma 3 27BGemma 3	gemma3:27b	✗ Out of memory— · 57 GB	🔀 GPU + CPU offload~20 tok/s · 29 GB	⚡ Runs on GPU~44 tok/s · 17 GB
Qwen3 30B-A3B (MoE)Qwen3	qwen3:30b-a3b	✗ Out of memory— · 62 GB	🔀 GPU + CPU offload~17 tok/s · 32 GB	⚡ Runs on GPU~41 tok/s · 18.5 GB
Qwen3 32BQwen3	qwen3:32b	✗ Out of memory— · 65 GB	🔀 GPU + CPU offload~15 tok/s · 34 GB	⚡ Runs on GPU~38 tok/s · 20 GB
QwQ 32BQwQ	qwq:32b	✗ Out of memory— · 65 GB	🔀 GPU + CPU offload~15 tok/s · 34 GB	⚡ Runs on GPU~38 tok/s · 20 GB
Qwen2.5 32BQwen2.5	qwen2.5:32b	✗ Out of memory— · 65 GB	🔀 GPU + CPU offload~15 tok/s · 34 GB	⚡ Runs on GPU~38 tok/s · 20 GB
DeepSeek-R1 32BDeepSeek-R1	deepseek-r1:32b	✗ Out of memory— · 66 GB	🔀 GPU + CPU offload~15 tok/s · 34 GB	⚡ Runs on GPU~38 tok/s · 20 GB
Qwen2.5-Coder 32BQwen2.5-Coder	qwen2.5-coder:32b	✗ Out of memory— · 65 GB	🔀 GPU + CPU offload~15 tok/s · 34 GB	⚡ Runs on GPU~38 tok/s · 20 GB
Mixtral 8×7B (MoE)Mixtral	mixtral:8x7b	✗ Out of memory— · 96 GB	🔀 GPU + CPU offload~7 tok/s · 50 GB	🔀 GPU + CPU offload~20 tok/s · 29 GB
Llama 3.1 70BLlama 3	llama3.1:70b	✗ Out of memory— · 140 GB	✗ Out of memory— · 74 GB	🔀 GPU + CPU offload~10 tok/s · 43 GB
DeepSeek-R1 70BDeepSeek-R1	deepseek-r1:70b	✗ Out of memory— · 144 GB	✗ Out of memory— · 74 GB	🔀 GPU + CPU offload~10 tok/s · 43 GB
Qwen2.5 72BQwen2.5	qwen2.5:72b	✗ Out of memory— · 148 GB	✗ Out of memory— · 76 GB	🔀 GPU + CPU offload~9 tok/s · 44 GB

Speed estimates use the memory-bandwidth formula: TPS ≈ (BW × 0.75) / model_size. Actual speeds vary by driver, OS, context length, and batch size. VRAM values include a ~1.5 GB system overhead reserve. CPU offload speeds assume ~40 GB/s DDR5 bandwidth.

How the Calculator Works

Two numbers determine whether a model runs on your machine: how much memory it needs, and how fast your hardware can move that memory. This tool uses both to give you a practical yes/no and a realistic speed estimate.

VRAM requirement by quantization

At FP16 (full precision), a model needs ~2 bytes per parameter. Q8_0 halves that to ~1 byte. Q4_K_M — the Ollama default — brings it to ~0.55 bytes. So a 7B model needs 14 GB at FP16, ~7 GB at Q8, and about 5 GB at Q4.

GPU + CPU offload

When a model does not fully fit in VRAM, Ollama offloads the remaining layers to system RAM. This works but cuts speed significantly — often to 3–10 tok/s. More GPU layers = faster generation.

Speed formula

Token generation speed is memory-bandwidth bound: TPS ≈ (bandwidth × 0.75) / model_size_GB. A faster GPU does not help much if its memory bandwidth is the same — which is why the RTX 4090 and 3090 are close in speed for small models.

System RAM overhead

The calculator reserves 4 GB of system RAM for OS use and 1.5 GB of VRAM for GPU overhead and KV cache. Increase system RAM to allow more CPU offloading for large models.

Why Q4_K_M is the sweet spot

Benchmarks consistently show Q4_K_M loses less than 2–3% on standard evals compared to FP16, while cutting VRAM use by 72%. For most tasks, the quality drop is not noticeable in practice.

MoE models

Mixture-of-Experts models (like Qwen3 235B-A22B or Mixtral 8×7B) have a large total parameter count but only activate a fraction per token. The listed VRAM includes all weights — you still need to load the full model, but compute cost is much lower.

Common Questions

My GPU has 24 GB — what is the largest model I can run at Q4?

At Q4_K_M, 24 GB comfortably fits Qwen2.5-32B, QwQ-32B, Mistral NeMo 12B, and Gemma 3 27B. With some VRAM shared with system tasks, the 70B class (Llama 3.1 70B, Qwen2.5 72B) will need CPU offloading unless you have a second GPU.

What is the difference between 'runs on GPU' and 'GPU + CPU offload'?

When a model fits entirely in VRAM, every operation happens on the GPU — fast and efficient. Offload means some layers are stored in system RAM and must be fetched over PCIe (16 GB/s) or unified memory during each forward pass. This works, but generation speed drops to roughly CPU bandwidth levels for the offloaded layers.

How do I run a model in Ollama?

Install Ollama from ollama.com, then open a terminal and run: ollama run <model_tag>. For example: ollama run qwen3:8b. Ollama automatically downloads the Q4_K_M version unless you specify a different quantization tag like qwen3:8b-q8_0 or qwen3:8b-fp16.

The calculator shows 3 tok/s for CPU offload — is that actually usable?

3–5 tok/s is slow for real-time chat but fine for batch tasks, summarization, or overnight jobs. Anything above 10 tok/s feels responsive for interactive use. 20+ tok/s is comfortable for coding assistants.

Why are Apple M-series chips listed with lower bandwidth than RTX cards?

Apple Silicon uses shared (unified) memory — the same physical chips serve both CPU and GPU. The bandwidth numbers listed (150–546 GB/s depending on chip) reflect memory bandwidth, not GPU VRAM bandwidth specifically. Despite the lower raw bandwidth compared to an RTX 4090, the unified architecture means zero PCIe overhead and very efficient offloading.

My model is not listed here. How do I estimate my GPU requirements?

Use the rule: Q4_K_M needs ~0.55 × params_B GB, Q8_0 needs ~1.0 × params_B GB, FP16 needs ~2.0 × params_B GB. Add 1–2 GB overhead. Example: a 13B model at Q4 needs roughly 0.55 × 13 + 1.5 ≈ 8.7 GB VRAM.

Related Calculators

How the Calculator Works

Common Questions