Gemma 4 QAT Self-Hosting Guide 2026: 26B in 15GB
TL;DR: On June 5 2026 Google DeepMind shipped quantization-aware-training (QAT) checkpoints for all five Gemma 4 sizes, cutting VRAM roughly 72% versus the BF16 baselines. The 26B-A4B MoE now fits in ~15GB. The catch: do not hand-convert the checkpoints to Q4_0 — a scale mismatch tanks accuracy. Use Unsloth’s pre-built GGUFs for Ollama/llama.cpp and Google’s w4a16 checkpoints for vLLM.
What you’ll have running after this guide:
- Gemma 4 26B-A4B answering on a single 16GB consumer GPU via Ollama, with flash attention and a quantized KV cache.
- The same model served through llama.cpp with speculative decoding, or through vLLM for multi-request throughput.
- A clear sense of which Gemma 4 size fits your card — and which quant format to never touch.
Honest take: QAT is the single biggest local-AI quality-of-life upgrade of 2026, but 90% of the failures people report are self-inflicted: they convert the HuggingFace checkpoint themselves and lose 15 points of accuracy. Pull a pre-converted GGUF and the model is genuinely near-BF16.
What QAT actually changes
Normal post-training quantization takes a finished BF16 model and rounds the weights down to 4-bit afterward. You save memory, but the model never learned to tolerate that rounding, so quality drops — sometimes a little, sometimes a lot.
Quantization-aware training simulates the 4-bit rounding during training. The model adjusts its weights to stay accurate under low precision. The result: a 4-bit checkpoint that behaves almost like the full-precision original, instead of a lossy approximation of it.
Google released QAT checkpoints for the whole Gemma 4 lineup on June 5 2026. The advertised number is a ~72% VRAM reduction against BF16, and it holds up. The 26B-A4B model that needed a datacenter card in BF16 now loads on a 16GB laptop GPU.
The memory map
These are the QAT weight footprints before context. Add headroom for the KV cache, which grows with context length.
| Model | Params | QAT VRAM (~) | Fits on |
|---|---|---|---|
| Gemma 4 E2B | effective 2B | ~3 GB (1 GB mobile) | Any iGPU / phone |
| Gemma 4 E4B | effective 4B | ~5 GB | 6–8 GB cards |
| Gemma 4 12B | 12B dense | ~7 GB | RTX 4060 Ti 16GB, 8GB+ |
| Gemma 4 26B-A4B | 26B total / 4B active MoE | ~15 GB | 16GB card / 16GB laptop |
| Gemma 4 31B | 31B dense | ~18 GB | RTX 3090 / 24GB |
The 26B-A4B is the sweet spot for most self-hosters: it activates only ~4B parameters per token (MoE), so it runs at small-model speed while drawing on 26B of total knowledge, and it lands in 15GB. The 31B dense model is stronger but wants a 24GB card like the RTX 3090 or RTX 4090. All five share Gemma 4’s 256K-token context window, though you’ll rarely fit the full window alongside the weights on consumer hardware.
The one mistake that ruins QAT
Here’s the trap, and it’s the reason most “QAT is overrated” posts exist.
The instinct is to download Google’s QAT checkpoint from HuggingFace and run llama.cpp’s converter to get a Q4_0 GGUF. Don’t. QAT checkpoints encode their quantization scales differently from standard weights. A naive converter misreads those parameters, and you get a measurable accuracy drop — for the 26B-A4B, community testing put a hand-rolled Q4_0 at around 70.2% on a reference eval.
The fix is to use Unsloth’s Dynamic GGUFs, published as UD-Q4_K_XL. They read the QAT scales correctly and recover quality to roughly 85.6% on the same eval — while being about 200MB smaller than the broken Q4_0. Unsloth ships only the UD-Q4_K_XL variant for these models precisely because plain Q4_0 degrades them.
The rule:
- Ollama / llama.cpp → Unsloth
UD-Q4_K_XLGGUFs. - vLLM / SGLang → Google’s official
w4a16compressed-tensors checkpoints. - Never → a DIY
Q4_0conversion of a QAT checkpoint.
Path 1: Ollama (easiest)
You need Ollama 0.22 or newer — earlier versions don’t parse the Gemma 4 QAT GGUF metadata correctly. Check your version first:
$ ollama --version
ollama version is 0.22.0
Pull and run the Unsloth GGUF straight from HuggingFace:
ollama run hf.co/unsloth/gemma-4-26B-A4B-it-qat-GGUF:UD-Q4_K_XL
Two environment variables matter for fitting the model and its context on a 16GB card. Flash attention reduces memory use during attention, and a quantized KV cache shrinks the per-token context cost:
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
# restart the Ollama server after setting these
Gemma 4 has recommended sampling parameters — using the defaults will give you flatter, less coherent output. Set them in your request or a Modelfile:
temperature 1.0
top_p 0.95
top_k 64
That’s the entire setup. On a 16GB card you’ll get interactive speeds with the 26B-A4B because only ~4B parameters activate per token.
Path 2: llama.cpp (most control)
If you want speculative decoding or fine-grained offload control, go straight to llama.cpp. The -hf flag pulls the GGUF for you:
./build/bin/llama-server \
-hf unsloth/gemma-4-12B-it-qat-GGUF:UD-Q4_K_XL \
-ngl 999 \
-fa on \
--host 0.0.0.0 --port 8080
-ngl 999 offloads all layers to GPU; -fa on enables flash attention. The 12B QAT model fits comfortably in ~7GB, so even an 8GB card handles it with room for context.
The Gemma 4 GGUFs ship with a multi-token-prediction (MTP) draft head, so you can turn on speculative decoding without a separate draft model:
./build/bin/llama-server \
-hf unsloth/gemma-4-26B-A4B-it-qat-GGUF:UD-Q4_K_XL \
--spec-type draft-mtp --spec-draft-n-max 4 \
-ngl 999 -fa on
This drafts up to 4 tokens ahead per step and verifies them in one pass — a real throughput win on memory-bound consumer GPUs.
Path 3: vLLM (throughput / multi-user)
For a server handling many concurrent requests, vLLM with Google’s w4a16 compressed-tensors checkpoint is the move. vLLM reads the quantization config from the checkpoint, so you don’t pass a quant flag:
vllm serve google/gemma-4-31B-it-qat-w4a16-ct \
--max-model-len 32768 \
--gpu-memory-utilization 0.90
There’s one important asymmetry. The 26B-A4B MoE has no w4a16 checkpoint — its expert dimensions are small enough that 4-bit weights lose too much quality. For the MoE on vLLM, quantize online to int8 instead, which costs more memory but keeps quality intact:
vllm serve google/gemma-4-26B-A4B-it \
--quantization int8_per_channel_weight_only \
--max-model-len 32768
int8 here gives roughly 47% memory savings versus BF16 — less aggressive than 4-bit, but the right call for this particular model.
If your hardware can’t hold the 31B at usable context, a cloud GPU is the pragmatic option for bursty workloads. An A100 or H100 instance on RunPod runs the full 31B w4a16 with the entire 256K context for a few dollars an hour. For the consumer-card side of this decision — which GPU to actually buy for Gemma 4 QAT — runaihome.com has a dedicated Gemma 4 QAT hardware guide.
A problem I hit, and the fix
First run of the 26B-A4B on a 16GB card OOM’d at ~12K context even though the weights are only 15GB. The culprit was the default f16 KV cache eating the remaining headroom fast. Two changes fixed it:
OLLAMA_KV_CACHE_TYPE=q8_0(or-fa onplus a quantized cache in llama.cpp) roughly halves per-token context memory.- Cap context to what you actually use.
--max-model-len 16384instead of the full 256K freed enough VRAM to run without spilling to system RAM.
Self-hosting QAT models is far more often a KV-cache budgeting problem than a weights problem. The weights are tiny now; context is what fills the card.
Backend comparison
| Ollama | llama.cpp | vLLM | |
|---|---|---|---|
| Setup effort | Lowest | Medium | Medium-high |
| Best quant | Unsloth UD-Q4_K_XL | Unsloth UD-Q4_K_XL | Google w4a16 / int8 MoE |
| Concurrency | Single-user | Single-user | Many requests |
| Speculative decoding | Limited | Yes (MTP) | Yes |
| Best for | Daily local chat | Tinkering, low VRAM | Teams, API serving |
The license caveat
Gemma 4 is not released under an OSI-approved open-source license. It ships under Google’s Gemma Terms of Use, which permit free commercial use and redistribution but attach a use-restrictions policy you must pass through to downstream users. That’s more permissive than most “open weight” terms and fine for the vast majority of self-hosters — but if your project requires a true FOSS license (MIT/Apache), Gemma 4 doesn’t qualify. For Apache-2.0 alternatives at similar sizes, look at Qwen3.6 or Codestral 2.
When NOT to use Gemma 4 QAT
- You need maximum quality and have the VRAM. The BF16 or w4a16 checkpoints edge out QAT GGUFs on the hardest tasks. QAT is a memory play, not a quality upgrade over full precision.
- Your task is long-context retrieval at full 256K. On a consumer card the KV cache for 256K won’t fit alongside the weights; you’ll be capped well below the advertised window.
- You require an OSI license. See the caveat above — pick an Apache/MIT model instead.
- You want to fine-tune and re-quantize. QAT checkpoints are awkward to convert; start from the standard weights if training is your goal.
FAQ
Do I need Ollama 0.22 specifically? 0.22 is the first version that parses the Gemma 4 QAT GGUF metadata correctly. Older versions may load the file but misread quantization scales. Upgrade before pulling.
Why is the 26B-A4B so much faster than its size suggests? It’s a Mixture-of-Experts model. Only ~4B of its 26B parameters activate per token, so inference runs at roughly 4B-model speed while the full 26B is available for knowledge.
Can I just convert the HuggingFace checkpoint myself?
You can, but you shouldn’t. Naive Q4_0 conversion of a QAT checkpoint loses significant accuracy because of a scale mismatch. Use Unsloth’s UD-Q4_K_XL GGUFs or Google’s w4a16 checkpoints.
Why does vLLM treat the MoE differently?
The 26B-A4B’s expert layers are too small to survive 4-bit weight quantization, so Google didn’t publish a w4a16 version. Use int8_per_channel_weight_only online quantization for that model on vLLM.
What’s the real-world quality gap between UD-Q4_K_XL and a bad Q4_0?
On the 26B-A4B reference eval, the naive Q4_0 scored around 70.2% while Unsloth’s Dynamic GGUF recovered to about 85.6% — and was smaller on disk. That gap is the whole reason this guide exists.
Sources
- Google DeepMind — Gemma 4 with quantization-aware training
- Unsloth Documentation — Gemma 4 QAT
- unsloth/gemma-4-26B-A4B-it-qat-GGUF · Hugging Face
- vllm-project/recipes — Gemma 4
- Gemma 4 model overview — Google AI for Developers
- runaihome.com — Gemma 4 QAT Local AI Hardware Update 2026
Recommended Gear
- RTX 4060 Ti 16GB — cheapest card that fits the 12B and 26B-A4B QAT models with context headroom.
- RTX 3090 — 24GB used-market value pick for the 31B dense model.
- RTX 4090 — fastest single-card option for 31B at longer context.
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →