
Enter your GPU VRAM and system RAM to see which models run in Ollama — at full or quantized precision — with estimated generation speed.
| Model | Ollama tag | Q4_K_M (default) | Q8_0 | FP16 full |
|---|---|---|---|---|
| nomic-embed-text 137Mnomic | nomic-embed-text | ⚡ Runs on GPU~1890 tok/s · 0.4 GB | ⚡ Runs on GPU~1890 tok/s · 0.4 GB | ⚡ Runs on GPU~2520 tok/s · 0.3 GB |
| mxbai-embed-large 335Mmxbai | mxbai-embed-large | ⚡ Runs on GPU~945 tok/s · 0.8 GB | ⚡ Runs on GPU~1080 tok/s · 0.7 GB | ⚡ Runs on GPU~1080 tok/s · 0.7 GB |
| Qwen2.5 0.5BQwen2.5 | qwen2.5:0.5b | ⚡ Runs on GPU~504 tok/s · 1.5 GB | ⚡ Runs on GPU~756 tok/s · 1 GB | ⚡ Runs on GPU~1080 tok/s · 0.7 GB |
| Qwen3 0.6BQwen3 | qwen3:0.6b | ⚡ Runs on GPU~504 tok/s · 1.5 GB | ⚡ Runs on GPU~687 tok/s · 1.1 GB | ⚡ Runs on GPU~1080 tok/s · 0.7 GB |
| Llama 3.2 1BLlama 3.2 | llama3.2:1b | ⚡ Runs on GPU~302 tok/s · 2.5 GB | ⚡ Runs on GPU~504 tok/s · 1.5 GB | ⚡ Runs on GPU~756 tok/s · 1 GB |
| Gemma 3 1BGemma 3 | gemma3:1b | ⚡ Runs on GPU~302 tok/s · 2.5 GB | ⚡ Runs on GPU~504 tok/s · 1.5 GB | ⚡ Runs on GPU~756 tok/s · 1 GB |
| DeepSeek-R1 1.5BDeepSeek-R1 | deepseek-r1:1.5b | ⚡ Runs on GPU~216 tok/s · 3.5 GB | ⚡ Runs on GPU~378 tok/s · 2 GB | ⚡ Runs on GPU~630 tok/s · 1.2 GB |
| Qwen3 1.7BQwen3 | qwen3:1.7b | ⚡ Runs on GPU~199 tok/s · 3.8 GB | ⚡ Runs on GPU~360 tok/s · 2.1 GB | ⚡ Runs on GPU~540 tok/s · 1.4 GB |
| Llama 3.2 3BLlama 3.2 | llama3.2:3b | ⚡ Runs on GPU~116 tok/s · 6.5 GB | ⚡ Runs on GPU~216 tok/s · 3.5 GB | ⚡ Runs on GPU~344 tok/s · 2.2 GB |
| Phi-4 Mini 3.8BPhi | phi4-mini:3.8b | ⚡ Runs on GPU~89 tok/s · 8.5 GB | ⚡ Runs on GPU~176 tok/s · 4.3 GB | ⚡ Runs on GPU~270 tok/s · 2.8 GB |
| Qwen3 4BQwen3 | qwen3:4b | ⚡ Runs on GPU~84 tok/s · 9 GB | ⚡ Runs on GPU~168 tok/s · 4.5 GB | ⚡ Runs on GPU~252 tok/s · 3 GB |
| Gemma 3 4BGemma 3 | gemma3:4b | ⚡ Runs on GPU~84 tok/s · 9 GB | ⚡ Runs on GPU~168 tok/s · 4.5 GB | ⚡ Runs on GPU~252 tok/s · 3 GB |
| Qwen2.5 7BQwen2.5 | qwen2.5:7b | ⚡ Runs on GPU~50 tok/s · 15 GB | ⚡ Runs on GPU~95 tok/s · 8 GB | ⚡ Runs on GPU~151 tok/s · 5 GB |
| Mistral 7BMistral | mistral:7b | ⚡ Runs on GPU~49 tok/s · 15.5 GB | ⚡ Runs on GPU~95 tok/s · 8 GB | ⚡ Runs on GPU~151 tok/s · 5 GB |
| DeepSeek-R1 7BDeepSeek-R1 | deepseek-r1:7b | ⚡ Runs on GPU~49 tok/s · 15.5 GB | ⚡ Runs on GPU~95 tok/s · 8 GB | ⚡ Runs on GPU~151 tok/s · 5 GB |
| Qwen2.5-Coder 7BQwen2.5-Coder | qwen2.5-coder:7b | ⚡ Runs on GPU~50 tok/s · 15 GB | ⚡ Runs on GPU~95 tok/s · 8 GB | ⚡ Runs on GPU~151 tok/s · 5 GB |
| Llama 3.1 8BLlama 3 | llama3.1:8b | ⚡ Runs on GPU~46 tok/s · 16.5 GB | ⚡ Runs on GPU~84 tok/s · 9 GB | ⚡ Runs on GPU~137 tok/s · 5.5 GB |
| Qwen3 8BQwen3 | qwen3:8b | ⚡ Runs on GPU~44 tok/s · 17 GB | ⚡ Runs on GPU~84 tok/s · 9 GB | ⚡ Runs on GPU~137 tok/s · 5.5 GB |
| DeepSeek-R1 8BDeepSeek-R1 | deepseek-r1:8b | ⚡ Runs on GPU~43 tok/s · 17.5 GB | ⚡ Runs on GPU~80 tok/s · 9.5 GB | ⚡ Runs on GPU~137 tok/s · 5.5 GB |
| Gemma 3 12BGemma 3 | gemma3:12b | 🔀 GPU + CPU offload~25 tok/s · 26 GB | ⚡ Runs on GPU~56 tok/s · 13.5 GB | ⚡ Runs on GPU~95 tok/s · 8 GB |
| Mistral NeMo 12BMistral | mistral-nemo:12b | 🔀 GPU + CPU offload~25 tok/s · 26 GB | ⚡ Runs on GPU~56 tok/s · 13.5 GB | ⚡ Runs on GPU~95 tok/s · 8 GB |
| Qwen3 14BQwen3 | qwen3:14b | 🔀 GPU + CPU offload~20 tok/s · 29.5 GB | ⚡ Runs on GPU~49 tok/s · 15.5 GB | ⚡ Runs on GPU~80 tok/s · 9.5 GB |
| Qwen2.5 14BQwen2.5 | qwen2.5:14b | 🔀 GPU + CPU offload~20 tok/s · 29 GB | ⚡ Runs on GPU~50 tok/s · 15 GB | ⚡ Runs on GPU~80 tok/s · 9.5 GB |
| DeepSeek-R1 14BDeepSeek-R1 | deepseek-r1:14b | 🔀 GPU + CPU offload~19 tok/s · 30 GB | ⚡ Runs on GPU~49 tok/s · 15.5 GB | ⚡ Runs on GPU~80 tok/s · 9.5 GB |
| Phi-4 14BPhi | phi4:14b | 🔀 GPU + CPU offload~20 tok/s · 29 GB | ⚡ Runs on GPU~49 tok/s · 15.5 GB | ⚡ Runs on GPU~80 tok/s · 9.5 GB |
| Codestral 22BCodestral | codestral:22b | 🔀 GPU + CPU offload~8 tok/s · 46 GB | 🔀 GPU + CPU offload~31 tok/s · 23.5 GB | ⚡ Runs on GPU~54 tok/s · 14 GB |
| Gemma 3 27BGemma 3 | gemma3:27b | ✗ Out of memory— · 57 GB | 🔀 GPU + CPU offload~20 tok/s · 29 GB | ⚡ Runs on GPU~44 tok/s · 17 GB |
| Qwen3 30B-A3B (MoE)Qwen3 | qwen3:30b-a3b | ✗ Out of memory— · 62 GB | 🔀 GPU + CPU offload~17 tok/s · 32 GB | ⚡ Runs on GPU~41 tok/s · 18.5 GB |
| Qwen3 32BQwen3 | qwen3:32b | ✗ Out of memory— · 65 GB | 🔀 GPU + CPU offload~15 tok/s · 34 GB | ⚡ Runs on GPU~38 tok/s · 20 GB |
| QwQ 32BQwQ | qwq:32b | ✗ Out of memory— · 65 GB | 🔀 GPU + CPU offload~15 tok/s · 34 GB | ⚡ Runs on GPU~38 tok/s · 20 GB |
| Qwen2.5 32BQwen2.5 | qwen2.5:32b | ✗ Out of memory— · 65 GB | 🔀 GPU + CPU offload~15 tok/s · 34 GB | ⚡ Runs on GPU~38 tok/s · 20 GB |
| DeepSeek-R1 32BDeepSeek-R1 | deepseek-r1:32b | ✗ Out of memory— · 66 GB | 🔀 GPU + CPU offload~15 tok/s · 34 GB | ⚡ Runs on GPU~38 tok/s · 20 GB |
| Qwen2.5-Coder 32BQwen2.5-Coder | qwen2.5-coder:32b | ✗ Out of memory— · 65 GB | 🔀 GPU + CPU offload~15 tok/s · 34 GB | ⚡ Runs on GPU~38 tok/s · 20 GB |
| Mixtral 8×7B (MoE)Mixtral | mixtral:8x7b | ✗ Out of memory— · 96 GB | 🔀 GPU + CPU offload~7 tok/s · 50 GB | 🔀 GPU + CPU offload~20 tok/s · 29 GB |
| Llama 3.1 70BLlama 3 | llama3.1:70b | ✗ Out of memory— · 140 GB | ✗ Out of memory— · 74 GB | 🔀 GPU + CPU offload~10 tok/s · 43 GB |
| DeepSeek-R1 70BDeepSeek-R1 | deepseek-r1:70b | ✗ Out of memory— · 144 GB | ✗ Out of memory— · 74 GB | 🔀 GPU + CPU offload~10 tok/s · 43 GB |
| Qwen2.5 72BQwen2.5 | qwen2.5:72b | ✗ Out of memory— · 148 GB | ✗ Out of memory— · 76 GB | 🔀 GPU + CPU offload~9 tok/s · 44 GB |
Speed estimates use the memory-bandwidth formula: TPS ≈ (BW × 0.75) / model_size. Actual speeds vary by driver, OS, context length, and batch size. VRAM values include a ~1.5 GB system overhead reserve. CPU offload speeds assume ~40 GB/s DDR5 bandwidth.
Two numbers determine whether a model runs on your machine: how much memory it needs, and how fast your hardware can move that memory. This tool uses both to give you a practical yes/no and a realistic speed estimate.
At Q4_K_M, 24 GB comfortably fits Qwen2.5-32B, QwQ-32B, Mistral NeMo 12B, and Gemma 3 27B. With some VRAM shared with system tasks, the 70B class (Llama 3.1 70B, Qwen2.5 72B) will need CPU offloading unless you have a second GPU.
When a model fits entirely in VRAM, every operation happens on the GPU — fast and efficient. Offload means some layers are stored in system RAM and must be fetched over PCIe (16 GB/s) or unified memory during each forward pass. This works, but generation speed drops to roughly CPU bandwidth levels for the offloaded layers.
Install Ollama from ollama.com, then open a terminal and run: ollama run <model_tag>. For example: ollama run qwen3:8b. Ollama automatically downloads the Q4_K_M version unless you specify a different quantization tag like qwen3:8b-q8_0 or qwen3:8b-fp16.
3–5 tok/s is slow for real-time chat but fine for batch tasks, summarization, or overnight jobs. Anything above 10 tok/s feels responsive for interactive use. 20+ tok/s is comfortable for coding assistants.
Apple Silicon uses shared (unified) memory — the same physical chips serve both CPU and GPU. The bandwidth numbers listed (150–546 GB/s depending on chip) reflect memory bandwidth, not GPU VRAM bandwidth specifically. Despite the lower raw bandwidth compared to an RTX 4090, the unified architecture means zero PCIe overhead and very efficient offloading.
Use the rule: Q4_K_M needs ~0.55 × params_B GB, Q8_0 needs ~1.0 × params_B GB, FP16 needs ~2.0 × params_B GB. Add 1–2 GB overhead. Example: a 13B model at Q4 needs roughly 0.55 × 13 + 1.5 ≈ 8.7 GB VRAM.