Sonu Sahani logo
Sonusahani.com
🖥 Free Tool

Local LLM Hardware Calculator

Enter your GPU VRAM and system RAM to see which models run in Ollama — at full or quantized precision — with estimated generation speed.

Your Hardware
GPU VRAM: 24 GBMem BW: 1008 GB/sTotal memory: 56 GB
33Run fully on GPU
18GPU + CPU offload
0RAM only (slow)
-10Out of memory
Filter:
ModelOllama tagQ4_K_M (default)Q8_0FP16 full
nomic-embed-text 137Mnomicnomic-embed-text
Runs on GPU~1890 tok/s · 0.4 GB
Runs on GPU~1890 tok/s · 0.4 GB
Runs on GPU~2520 tok/s · 0.3 GB
mxbai-embed-large 335Mmxbaimxbai-embed-large
Runs on GPU~945 tok/s · 0.8 GB
Runs on GPU~1080 tok/s · 0.7 GB
Runs on GPU~1080 tok/s · 0.7 GB
Qwen2.5 0.5BQwen2.5qwen2.5:0.5b
Runs on GPU~504 tok/s · 1.5 GB
Runs on GPU~756 tok/s · 1 GB
Runs on GPU~1080 tok/s · 0.7 GB
Qwen3 0.6BQwen3qwen3:0.6b
Runs on GPU~504 tok/s · 1.5 GB
Runs on GPU~687 tok/s · 1.1 GB
Runs on GPU~1080 tok/s · 0.7 GB
Llama 3.2 1BLlama 3.2llama3.2:1b
Runs on GPU~302 tok/s · 2.5 GB
Runs on GPU~504 tok/s · 1.5 GB
Runs on GPU~756 tok/s · 1 GB
Gemma 3 1BGemma 3gemma3:1b
Runs on GPU~302 tok/s · 2.5 GB
Runs on GPU~504 tok/s · 1.5 GB
Runs on GPU~756 tok/s · 1 GB
DeepSeek-R1 1.5BDeepSeek-R1deepseek-r1:1.5b
Runs on GPU~216 tok/s · 3.5 GB
Runs on GPU~378 tok/s · 2 GB
Runs on GPU~630 tok/s · 1.2 GB
Qwen3 1.7BQwen3qwen3:1.7b
Runs on GPU~199 tok/s · 3.8 GB
Runs on GPU~360 tok/s · 2.1 GB
Runs on GPU~540 tok/s · 1.4 GB
Llama 3.2 3BLlama 3.2llama3.2:3b
Runs on GPU~116 tok/s · 6.5 GB
Runs on GPU~216 tok/s · 3.5 GB
Runs on GPU~344 tok/s · 2.2 GB
Phi-4 Mini 3.8BPhiphi4-mini:3.8b
Runs on GPU~89 tok/s · 8.5 GB
Runs on GPU~176 tok/s · 4.3 GB
Runs on GPU~270 tok/s · 2.8 GB
Qwen3 4BQwen3qwen3:4b
Runs on GPU~84 tok/s · 9 GB
Runs on GPU~168 tok/s · 4.5 GB
Runs on GPU~252 tok/s · 3 GB
Gemma 3 4BGemma 3gemma3:4b
Runs on GPU~84 tok/s · 9 GB
Runs on GPU~168 tok/s · 4.5 GB
Runs on GPU~252 tok/s · 3 GB
Qwen2.5 7BQwen2.5qwen2.5:7b
Runs on GPU~50 tok/s · 15 GB
Runs on GPU~95 tok/s · 8 GB
Runs on GPU~151 tok/s · 5 GB
Mistral 7BMistralmistral:7b
Runs on GPU~49 tok/s · 15.5 GB
Runs on GPU~95 tok/s · 8 GB
Runs on GPU~151 tok/s · 5 GB
DeepSeek-R1 7BDeepSeek-R1deepseek-r1:7b
Runs on GPU~49 tok/s · 15.5 GB
Runs on GPU~95 tok/s · 8 GB
Runs on GPU~151 tok/s · 5 GB
Qwen2.5-Coder 7BQwen2.5-Coderqwen2.5-coder:7b
Runs on GPU~50 tok/s · 15 GB
Runs on GPU~95 tok/s · 8 GB
Runs on GPU~151 tok/s · 5 GB
Llama 3.1 8BLlama 3llama3.1:8b
Runs on GPU~46 tok/s · 16.5 GB
Runs on GPU~84 tok/s · 9 GB
Runs on GPU~137 tok/s · 5.5 GB
Qwen3 8BQwen3qwen3:8b
Runs on GPU~44 tok/s · 17 GB
Runs on GPU~84 tok/s · 9 GB
Runs on GPU~137 tok/s · 5.5 GB
DeepSeek-R1 8BDeepSeek-R1deepseek-r1:8b
Runs on GPU~43 tok/s · 17.5 GB
Runs on GPU~80 tok/s · 9.5 GB
Runs on GPU~137 tok/s · 5.5 GB
Gemma 3 12BGemma 3gemma3:12b
🔀 GPU + CPU offload~25 tok/s · 26 GB
Runs on GPU~56 tok/s · 13.5 GB
Runs on GPU~95 tok/s · 8 GB
Mistral NeMo 12BMistralmistral-nemo:12b
🔀 GPU + CPU offload~25 tok/s · 26 GB
Runs on GPU~56 tok/s · 13.5 GB
Runs on GPU~95 tok/s · 8 GB
Qwen3 14BQwen3qwen3:14b
🔀 GPU + CPU offload~20 tok/s · 29.5 GB
Runs on GPU~49 tok/s · 15.5 GB
Runs on GPU~80 tok/s · 9.5 GB
Qwen2.5 14BQwen2.5qwen2.5:14b
🔀 GPU + CPU offload~20 tok/s · 29 GB
Runs on GPU~50 tok/s · 15 GB
Runs on GPU~80 tok/s · 9.5 GB
DeepSeek-R1 14BDeepSeek-R1deepseek-r1:14b
🔀 GPU + CPU offload~19 tok/s · 30 GB
Runs on GPU~49 tok/s · 15.5 GB
Runs on GPU~80 tok/s · 9.5 GB
Phi-4 14BPhiphi4:14b
🔀 GPU + CPU offload~20 tok/s · 29 GB
Runs on GPU~49 tok/s · 15.5 GB
Runs on GPU~80 tok/s · 9.5 GB
Codestral 22BCodestralcodestral:22b
🔀 GPU + CPU offload~8 tok/s · 46 GB
🔀 GPU + CPU offload~31 tok/s · 23.5 GB
Runs on GPU~54 tok/s · 14 GB
Gemma 3 27BGemma 3gemma3:27b
Out of memory · 57 GB
🔀 GPU + CPU offload~20 tok/s · 29 GB
Runs on GPU~44 tok/s · 17 GB
Qwen3 30B-A3B (MoE)Qwen3qwen3:30b-a3b
Out of memory · 62 GB
🔀 GPU + CPU offload~17 tok/s · 32 GB
Runs on GPU~41 tok/s · 18.5 GB
Qwen3 32BQwen3qwen3:32b
Out of memory · 65 GB
🔀 GPU + CPU offload~15 tok/s · 34 GB
Runs on GPU~38 tok/s · 20 GB
QwQ 32BQwQqwq:32b
Out of memory · 65 GB
🔀 GPU + CPU offload~15 tok/s · 34 GB
Runs on GPU~38 tok/s · 20 GB
Qwen2.5 32BQwen2.5qwen2.5:32b
Out of memory · 65 GB
🔀 GPU + CPU offload~15 tok/s · 34 GB
Runs on GPU~38 tok/s · 20 GB
DeepSeek-R1 32BDeepSeek-R1deepseek-r1:32b
Out of memory · 66 GB
🔀 GPU + CPU offload~15 tok/s · 34 GB
Runs on GPU~38 tok/s · 20 GB
Qwen2.5-Coder 32BQwen2.5-Coderqwen2.5-coder:32b
Out of memory · 65 GB
🔀 GPU + CPU offload~15 tok/s · 34 GB
Runs on GPU~38 tok/s · 20 GB
Mixtral 8×7B (MoE)Mixtralmixtral:8x7b
Out of memory · 96 GB
🔀 GPU + CPU offload~7 tok/s · 50 GB
🔀 GPU + CPU offload~20 tok/s · 29 GB
Llama 3.1 70BLlama 3llama3.1:70b
Out of memory · 140 GB
Out of memory · 74 GB
🔀 GPU + CPU offload~10 tok/s · 43 GB
DeepSeek-R1 70BDeepSeek-R1deepseek-r1:70b
Out of memory · 144 GB
Out of memory · 74 GB
🔀 GPU + CPU offload~10 tok/s · 43 GB
Qwen2.5 72BQwen2.5qwen2.5:72b
Out of memory · 148 GB
Out of memory · 76 GB
🔀 GPU + CPU offload~9 tok/s · 44 GB

Speed estimates use the memory-bandwidth formula: TPS ≈ (BW × 0.75) / model_size. Actual speeds vary by driver, OS, context length, and batch size. VRAM values include a ~1.5 GB system overhead reserve. CPU offload speeds assume ~40 GB/s DDR5 bandwidth.

How the Calculator Works

Two numbers determine whether a model runs on your machine: how much memory it needs, and how fast your hardware can move that memory. This tool uses both to give you a practical yes/no and a realistic speed estimate.

VRAM requirement by quantization
At FP16 (full precision), a model needs ~2 bytes per parameter. Q8_0 halves that to ~1 byte. Q4_K_M — the Ollama default — brings it to ~0.55 bytes. So a 7B model needs 14 GB at FP16, ~7 GB at Q8, and about 5 GB at Q4.
GPU + CPU offload
When a model does not fully fit in VRAM, Ollama offloads the remaining layers to system RAM. This works but cuts speed significantly — often to 3–10 tok/s. More GPU layers = faster generation.
Speed formula
Token generation speed is memory-bandwidth bound: TPS ≈ (bandwidth × 0.75) / model_size_GB. A faster GPU does not help much if its memory bandwidth is the same — which is why the RTX 4090 and 3090 are close in speed for small models.
System RAM overhead
The calculator reserves 4 GB of system RAM for OS use and 1.5 GB of VRAM for GPU overhead and KV cache. Increase system RAM to allow more CPU offloading for large models.
Why Q4_K_M is the sweet spot
Benchmarks consistently show Q4_K_M loses less than 2–3% on standard evals compared to FP16, while cutting VRAM use by 72%. For most tasks, the quality drop is not noticeable in practice.
MoE models
Mixture-of-Experts models (like Qwen3 235B-A22B or Mixtral 8×7B) have a large total parameter count but only activate a fraction per token. The listed VRAM includes all weights — you still need to load the full model, but compute cost is much lower.

Common Questions

My GPU has 24 GB — what is the largest model I can run at Q4?

At Q4_K_M, 24 GB comfortably fits Qwen2.5-32B, QwQ-32B, Mistral NeMo 12B, and Gemma 3 27B. With some VRAM shared with system tasks, the 70B class (Llama 3.1 70B, Qwen2.5 72B) will need CPU offloading unless you have a second GPU.

What is the difference between 'runs on GPU' and 'GPU + CPU offload'?

When a model fits entirely in VRAM, every operation happens on the GPU — fast and efficient. Offload means some layers are stored in system RAM and must be fetched over PCIe (16 GB/s) or unified memory during each forward pass. This works, but generation speed drops to roughly CPU bandwidth levels for the offloaded layers.

How do I run a model in Ollama?

Install Ollama from ollama.com, then open a terminal and run: ollama run <model_tag>. For example: ollama run qwen3:8b. Ollama automatically downloads the Q4_K_M version unless you specify a different quantization tag like qwen3:8b-q8_0 or qwen3:8b-fp16.

The calculator shows 3 tok/s for CPU offload — is that actually usable?

3–5 tok/s is slow for real-time chat but fine for batch tasks, summarization, or overnight jobs. Anything above 10 tok/s feels responsive for interactive use. 20+ tok/s is comfortable for coding assistants.

Why are Apple M-series chips listed with lower bandwidth than RTX cards?

Apple Silicon uses shared (unified) memory — the same physical chips serve both CPU and GPU. The bandwidth numbers listed (150–546 GB/s depending on chip) reflect memory bandwidth, not GPU VRAM bandwidth specifically. Despite the lower raw bandwidth compared to an RTX 4090, the unified architecture means zero PCIe overhead and very efficient offloading.

My model is not listed here. How do I estimate my GPU requirements?

Use the rule: Q4_K_M needs ~0.55 × params_B GB, Q8_0 needs ~1.0 × params_B GB, FP16 needs ~2.0 × params_B GB. Add 1–2 GB overhead. Example: a 13B model at Q4 needs roughly 0.55 × 13 + 1.5 ≈ 8.7 GB VRAM.