GGUF Quantization Guide 2026: Q4_K_M vs Q5_K_M vs Q8_0
The Q4_K_M vs Q5_K_M vs Q8_0 decision is the first one you make every time you pull a new model — and most guides reduce it to “Q4 is fine for most people.” That’s true. It’s also not the whole picture.
What follows is the breakdown with actual numbers: file sizes, VRAM budgets, perplexity deltas from FP16, and generation speeds — all from the llama.cpp official documentation and verified benchmarks. By the end, you’ll have a hardware-specific decision rule that takes 30 seconds to apply, not a table of caveats.
What quantization does
A full-precision language model stores each weight as a 16-bit floating-point number. An 8B parameter model at FP16 takes 14.96 GiB of storage and roughly the same in VRAM — out of reach for most consumer GPUs. Quantization compresses those weights by mapping them to a smaller set of representable values. A 4-bit quantized 8B model fits in 4.58 GiB.
The compression is lossy. Rounding weights to fewer bits introduces noise. The question is how much, and whether it matters for your specific workload. The short answer is that the quality gap between Q4 and Q8 is small enough that most users won’t perceive it in chat. The gap between Q3 and Q4 is where things break noticeably. Most quantization guides worry about the wrong cliff.
GGUF and the three quantization families
GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp (MIT license, 112k GitHub stars, build b9297 as of May 23, 2026), Ollama, LM Studio, and nearly every other local inference tool. It replaced the older GGML format and is now the universal container for quantized open-source models.
Three quantization families exist within GGUF:
Legacy quants (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0) — the original scheme. At 8 bits, the legacy Q8_0 format remains excellent and widely used. At lower bit depths, the legacy formats are largely outclassed by K-quants at the same bit count.
K-quants (Q2_K through Q6_K) — the current standard for 4–6 bit compression. These use a super-block structure: 256 weights are grouped together, and the per-block scales are themselves quantized, which recovers significant quality at the same average bit depth. The suffixes indicate mixed-precision aggression: _S (small/aggressive), _M (medium/balanced), _L (large/conservative). The _M variant is the recommended default for most use cases.
IQ quants (IQ1_S through IQ4_XS) — importance-matrix-based quantization. These use lookup tables derived from calibration data that records which weights matter most for model behavior. They deliver better quality per bit than K-quants at the lowest bit levels, but require an imatrix calibration file to work correctly. Without an imatrix, IQ quant output is noticeably worse than the equivalent K-quant.
For the vast majority of decisions, the relevant question is: Q4_K_M, Q5_K_M, or Q8_0?
Q4_K_M: the default for a reason
Q4_K_M averages 4.89 bits per weight. For a Llama 3.1 8B model, that’s a 4.58 GiB file. In VRAM, add ~10–20% overhead for context buffers and metadata — budget 5.5–6 GB for an 8B model at a moderate context window.
The internal structure is mixed-precision: Q4_K_M uses Q6_K for half of the attention value-projection and feed-forward output tensors (the layers most sensitive to quantization noise) and Q4_K for everything else. This selective higher precision is why K-quants outperform legacy Q4_0 at essentially the same file size.
Quality: Perplexity increase over FP16 is approximately +0.18 on Llama 3.1 8B. In absolute terms, this is small — around 3% degradation from the full-precision baseline. For general chat, summarization, and document Q&A tasks, Q4_K_M output is indistinguishable from FP16 to most users.
Speed: Q4_K_M generates at approximately 71.9 tokens/second on an 8B model, with prompt processing (prefill) at ~821 tokens/second. These figures come from the llama.cpp quantize README and reflect single-GPU consumer hardware performance.
Ollama defaults to Q4_K_M for most models in its registry. When you run ollama pull llama3.2, you get Q4_K_M unless you specify otherwise. That default choice is deliberate — it runs on 6 GB VRAM, produces output that satisfies nearly every general-use workload, and generates fast enough that interactive use feels responsive.
Q5_K_M: the step-up that earns its VRAM
Q5_K_M uses 5.70 bits per weight. The same 8B model becomes a 5.33 GiB file — 16% larger than Q4_K_M. VRAM requirement for an 8B model: approximately 6.5 GB.
The quality improvement is real but modest on perplexity benchmarks: +0.06 over FP16 versus Q4_K_M’s +0.18 — a 66% reduction in quantization noise for 16% more memory. Whether that matters depends entirely on the task.
Where the Q4→Q5 upgrade earns its cost: coding and structured reasoning. At 4 bits, the accumulated weight noise is enough to occasionally corrupt variable names, skip edge conditions in logic chains, or produce subtly malformed JSON output. The degradation isn’t catastrophic — it’s the kind of error you might attribute to the model’s capability rather than its quantization. At 5 bits, this class of error largely disappears. If you’re running a coding assistant or using the model for structured data extraction in a pipeline, Q5_K_M is a meaningful upgrade over Q4_K_M.
Where it doesn’t matter: chat, summarization, translation, creative writing. The Q4→Q5 delta is not perceptible in output quality for these workloads.
Speed: Q5_K_M generates at approximately 67.2 tokens/second on an 8B model — about 7% slower than Q4_K_M for generation. Prompt processing drops to ~758 tokens/second.
Q8_0: near-lossless, with a speed tradeoff worth understanding
Q8_0 is the legacy 8-bit format. It averages 8.50 bits per weight, producing a 7.95 GiB file for an 8B model — roughly 2× the VRAM of Q4_K_M. Budget 9–10 GB VRAM for an 8B model at normal context lengths.
The quality delta from FP16 is approximately +0.01 perplexity — effectively lossless. No task type shows a meaningful degradation compared to running the model at full precision. This is the format to use when you need a reference-quality baseline.
The counterintuitive speed profile: Q8_0 generates at approximately 50.9 tokens/second — 29% slower than Q4_K_M for token generation. That’s a significant penalty for interactive use. However, prompt processing runs at ~865 tokens/second, faster than Q4_K_M’s 821.
This is not a paradox. Prompt processing (prefill) is compute-bound: all input tokens are processed in large parallel matrix operations on the GPU, and modern tensor cores have highly optimized INT8 paths that run efficiently at 8-bit precision. Token generation (decode) is memory-bandwidth-bound: producing each new token requires reading every weight in the model once from VRAM. Moving 8 bytes per parameter through memory buses takes proportionally longer than moving 4.9 bytes. The more data the GPU has to shuttle per generated token, the slower generation gets — regardless of computational throughput.
Practical implication: if your workload is document ingestion, RAG retrieval, or long-context summarization (lots of prompt tokens, few generated tokens), Q8_0 costs you almost nothing in speed and gives you lossless quality. If you’re generating long responses in an interactive chat context, you’ll notice the 29% generation slowdown.
The other Q8_0 risk: VRAM crowding. A 7B Q8_0 fits on an 8 GB card, but leaves little room for context. Q4_K_M on the same card lets you run a 13B model with headroom to spare — and the 13B Q4_K_M almost always outperforms the 7B Q8_0 on output quality. Before reaching for Q8_0, ask whether that VRAM is better spent on a larger model at lower precision.
Comparison table
All figures are for Llama 3.1 8B, sourced from the llama.cpp official quantize README (build b9297). Speed benchmarked on single-GPU consumer hardware.
| Format | Bits/weight | File size | VRAM (8B) | Perplexity Δ | Gen (t/s) | Prompt (t/s) |
|---|---|---|---|---|---|---|
| Q4_K_S | 4.67 | 4.36 GiB | ~5.5 GB | ~+0.22 | 76.7 | 818.6 |
| Q4_K_M | 4.89 | 4.58 GiB | ~6 GB | ~+0.18 | 71.9 | 821.8 |
| Q5_K_M | 5.70 | 5.33 GiB | ~6.5 GB | ~+0.06 | 67.2 | 758.7 |
| Q6_K | 6.57 | 6.14 GiB | ~8 GB | ~+0.03 | ~62 | ~790 |
| Q8_0 | 8.50 | 7.95 GiB | ~10 GB | ~+0.01 | 50.9 | 865.1 |
| F16 | 16.00 | 14.96 GiB | ~18 GB | baseline | 29.2 | 923.5 |
Bold rows are the three you’ll choose between most often. Note that FP16 is actually the slowest for generation — not a mistake. At 16 bits per weight, memory bandwidth is the bottleneck for sequential token generation, and it saturates the GPU’s memory bus before compute becomes the limiting factor.
VRAM requirements by model size
These estimates add ~15% overhead above file size to account for KV cache, context buffers, and metadata at a 4K–8K token context window. Longer contexts require additional VRAM.
| Model size | Q4_K_M | Q5_K_M | Q8_0 |
|---|---|---|---|
| 3B | ~2.5 GB | ~3 GB | ~4 GB |
| 7–8B | ~5–6 GB | ~6–7 GB | ~9–10 GB |
| 13B | ~8–9 GB | ~10–11 GB | ~15–16 GB |
| 30–34B | ~19–22 GB | ~23–26 GB | ~34–38 GB |
| 70B | ~38–42 GB | ~47–51 GB | ~72–76 GB |
For 70B models at Q8_0, you need a single A100 80GB or equivalent. At Q4_K_M, 70B runs on dual RTX 3090s (48 GB combined) or fits with some headroom on a 48 GB A6000. If you’re evaluating cloud GPU options for running 70B models at higher quantization, RunPod has A100-80GB and H100 instances well-suited for this.
When NOT to use each
Don’t use Q4_K_M when:
- You’re benchmarking or comparing model output quality. Quantization noise becomes a confound. Use Q8_0 as your baseline.
- Your workload is primarily code generation or structured output in a production pipeline. Q5_K_M or higher is more reliable.
- A larger model at Q4_K_M fits in your VRAM anyway — run the 13B instead of Q8_0 on the 7B. Larger models at moderate quantization beat smaller models at lossless quantization almost every time.
Don’t use Q5_K_M when:
- Memory is tight and Q4_K_M is the only thing that fits. A running Q4 beats a swapping Q5 every time. Disk-backed inference (model spilling to RAM or swap) is 10–100× slower than GPU inference regardless of quantization level.
- You’re doing simple chat or summarization on a 7B model. The quality delta from Q4 on non-reasoning tasks isn’t perceptible. Save the VRAM for context.
Don’t use Q8_0 when:
- You’re on a 6–8 GB VRAM card and need a long context window. Q8_0 on a 7B model uses ~9–10 GB — the context cache has nowhere to live.
- You’re generating long responses interactively. The 29% generation slowdown is noticeable in real-time use.
- The VRAM would be better spent on a larger Q4_K_M model. An 8B Q8_0 and a 13B Q4_K_M occupy roughly the same VRAM; the 13B wins on output quality for almost all tasks.
IQ quants for constrained hardware
IQ quants become relevant when even Q4_K_M is too large. For an 8B model:
- IQ2_XXS: 2.38 bits/weight, 2.23 GiB. Fits in 3 GB VRAM.
- IQ1_S: 2.00 bits/weight, 1.87 GiB. Sub-2-bit compression.
Used without an importance matrix, IQ quants produce poor output — worse than Q3_K on the same bit depth. With a good imatrix file calibrated against representative text, IQ3_M and IQ4_XS deliver quality competitive with Q4_K_M at significantly lower VRAM cost. The key word is “calibrated” — the imatrix must be generated on text that matches your actual use case.
Pre-quantized IQ models with baked-in imatrix are available on Hugging Face from quantizers like bartowski and unsloth. If you need a 34B or 70B model on a 16 GB card and you’re willing to accept some quality degradation, IQ4_XS with imatrix gets you there. Check for the imatrix tag on the Hugging Face model card before trusting the quantization quality.
The imatrix effect is most significant below Q5. For Q6_K and above, the bit depth is sufficient that calibration data has diminishing returns.
Hardware-specific recommendations
6 GB VRAM (GTX 1080 Ti, RTX 3060 6GB)
- 7B/8B: Q4_K_M fits with headroom for modest context. Q5_K_M is tight.
- Recommendation: Q4_K_M on 7–8B models. Skip Q8_0 entirely.
8 GB VRAM (RTX 3070, RTX 4060 Ti)
- 7B/8B: Q5_K_M is comfortable. Q8_0 fits but leaves ~1 GB for context.
- 13B: Q4_K_M on the edge. Viable at short contexts.
- Recommendation: Q5_K_M on 7–8B. Q4_K_M on 13B if you’re careful with context length.
12 GB VRAM (RTX 3080 12GB, RTX 4070)
- 7B/8B: Q8_0 with room for context.
- 13B: Q5_K_M comfortably.
- Recommendation: Q8_0 on 8B for quality-sensitive work; Q5_K_M on 13B as the default.
16 GB VRAM (RTX 4080, RTX 4080 Super)
- 13B: Q8_0 with good headroom.
- 30B: Q4_K_M is tight; watch VRAM usage at longer contexts.
- Recommendation: Q8_0 on 13B. Q4_K_M on 30B with context budget discipline.
24 GB VRAM (RTX 3090, RTX 4090)
- 30B: Q5_K_M or Q8_0.
- 70B: Requires CPU offloading. Q4_K_M with partial offload runs; pure GPU is not feasible.
- Recommendation: Q8_0 on 30B models. For 70B, use split CPU/GPU inference or cloud.
If you’re evaluating hardware upgrades specifically for running 70B class models, the GPU buying guides at runaihome.com cover which consumer and prosumer cards are worth the investment for local inference workloads.
Quantizing your own models
Most users pull pre-quantized models from the Ollama registry or Hugging Face. When you need a quantization level that isn’t available — or you’re converting a fine-tuned model — llama.cpp’s llama-quantize tool handles it.
# Step 1: convert safetensors to F16 GGUF
python convert_hf_to_gguf.py path/to/model \
--outtype f16 \
--outfile model-f16.gguf
# Step 2: quantize to Q4_K_M (swap for Q5_K_M or Q8_0 as needed)
llama-quantize model-f16.gguf model-Q4_K_M.gguf Q4_K_M
# Optional: generate importance matrix for IQ quants
llama-imatrix -m model-f16.gguf \
-f calibration_data.txt \
-o imatrix.dat
# Quantize with imatrix (use for IQ4_XS and below)
llama-quantize --imatrix imatrix.dat \
model-f16.gguf model-IQ4_XS.gguf IQ4_XS
The convert_hf_to_gguf.py script lives in the llama.cpp repository under tools/. The conversion requires the model in safetensors format (standard Hugging Face output). The resulting F16 GGUF is the starting point for any quantization level.
For Ollama specifically, you can pull specific quantization variants directly:
# Pull Q8_0 variant
ollama pull llama3.2:3b-instruct-q8_0
# Pull Q5 variant where available
ollama pull qwen3:8b-q5_K_M
The default ollama pull model downloads Q4_K_M. Not every model in the Ollama registry has all quantization variants; check the available tags on the model page before expecting a specific format.
The actual decision
Three questions, ranked:
-
Does it fit in VRAM? If not, drop down one quantization level, or switch to IQ quants, or add CPU offloading. A model that swaps to system RAM runs 10–100× slower than one that fits entirely on the GPU.
-
What is the workload? Code generation, structured output, multi-step arithmetic: use Q5_K_M or higher. Chat, summarization, translation: Q4_K_M is fine. Benchmarking or debugging model behavior: use Q8_0.
-
Does generation speed matter? Q8_0 is 29% slower than Q4_K_M on token generation. For interactive use, that’s perceptible. For batch processing, pipelines, or mostly-read workloads, it’s irrelevant.
The quality cliff everyone worries about — going from Q8 down to Q4 — is real but modest: roughly +0.18 perplexity versus +0.01, a 2–3% difference. The actual quality cliff is between Q3 and Q4, where perplexity shoots up and coding and reasoning tasks start to degrade meaningfully. Below Q4_K_M, you need imatrix-calibrated IQ quants to stay competitive.
Between Q4_K_M and Q8_0, you’re choosing between two options that both produce good output. Pick the one your hardware runs without starving your context window.
For how quantization choice fits into a broader local stack decision — pairing the right runner with a UI and RAG layer — see The Open-Source AI Stack in 2026 and Ollama vs vLLM 2026 for the production-scale perspective where quantization interacts with serving throughput and multi-user latency.
1V1 PLAYBOOK · LOCAL LLM
Cut your local AI bill from $400/month cloud GPU to $47/month at home.
4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.
Get it for $19 (early bird) →Sources
- llama.cpp quantize README — official quantization type table with benchmark data
- llama.cpp GitHub Releases — build b9297, May 23 2026
- Which Quantization Should I Use? — arxiv evaluation on Llama-3.1-8B-Instruct (2601.14277)
- Importance matrix (imatrix) — llama.cpp documentation
- Choosing a GGUF Model: K-Quants, IQ Variants, and Legacy Formats
- GGUF Optimization: A Technical Deep Dive for Practitioners
- AI Model Quantization Guide — Local AI Zone
Recommended Gear
The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →