Jun 30, 2026

Gemma 4 QAT Self-Hosting Guide 2026: 26B in 15GB

By AIFoss · 10 min read

gemmallmollamavllmllamacppselfhostedquantization

TL;DR: On June 5 2026 Google DeepMind shipped quantization-aware-training (QAT) checkpoints for all five Gemma 4 sizes, cutting VRAM roughly 72% versus the BF16 baselines. The 26B-A4B MoE now fits in ~15GB. The catch: do not hand-convert the checkpoints to Q4_0 — a scale mismatch tanks accuracy. Use Unsloth’s pre-built GGUFs for Ollama/llama.cpp and Google’s w4a16 checkpoints for vLLM.

What you’ll have running after this guide:

Gemma 4 26B-A4B answering on a single 16GB consumer GPU via Ollama, with flash attention and a quantized KV cache.
The same model served through llama.cpp with speculative decoding, or through vLLM for multi-request throughput.
A clear sense of which Gemma 4 size fits your card — and which quant format to never touch.

Honest take: QAT is the single biggest local-AI quality-of-life upgrade of 2026, but 90% of the failures people report are self-inflicted: they convert the HuggingFace checkpoint themselves and lose 15 points of accuracy. Pull a pre-converted GGUF and the model is genuinely near-BF16.

What QAT actually changes

Normal post-training quantization takes a finished BF16 model and rounds the weights down to 4-bit afterward. You save memory, but the model never learned to tolerate that rounding, so quality drops — sometimes a little, sometimes a lot.

Quantization-aware training simulates the 4-bit rounding during training. The model adjusts its weights to stay accurate under low precision. The result: a 4-bit checkpoint that behaves almost like the full-precision original, instead of a lossy approximation of it.

Google released QAT checkpoints for the whole Gemma 4 lineup on June 5 2026. The advertised number is a ~72% VRAM reduction against BF16, and it holds up. The 26B-A4B model that needed a datacenter card in BF16 now loads on a 16GB laptop GPU.

The memory map

These are the QAT weight footprints before context. Add headroom for the KV cache, which grows with context length.

Model	Params	QAT VRAM (~)	Fits on
Gemma 4 E2B	effective 2B	~3 GB (1 GB mobile)	Any iGPU / phone
Gemma 4 E4B	effective 4B	~5 GB	6–8 GB cards
Gemma 4 12B	12B dense	~7 GB	RTX 4060 Ti 16GB, 8GB+
Gemma 4 26B-A4B	26B total / 4B active MoE	~15 GB	16GB card / 16GB laptop
Gemma 4 31B	31B dense	~18 GB	RTX 3090 / 24GB

The 26B-A4B is the sweet spot for most self-hosters: it activates only ~4B parameters per token (MoE), so it runs at small-model speed while drawing on 26B of total knowledge, and it lands in 15GB. The 31B dense model is stronger but wants a 24GB card like the RTX 3090 or RTX 4090. All five share Gemma 4’s 256K-token context window, though you’ll rarely fit the full window alongside the weights on consumer hardware.

The one mistake that ruins QAT

Here’s the trap, and it’s the reason most “QAT is overrated” posts exist.

The instinct is to download Google’s QAT checkpoint from HuggingFace and run llama.cpp’s converter to get a Q4_0 GGUF. Don’t. QAT checkpoints encode their quantization scales differently from standard weights. A naive converter misreads those parameters, and you get a measurable accuracy drop — for the 26B-A4B, community testing put a hand-rolled Q4_0 at around 70.2% on a reference eval.

The fix is to use Unsloth’s Dynamic GGUFs, published as UD-Q4_K_XL. They read the QAT scales correctly and recover quality to roughly 85.6% on the same eval — while being about 200MB smaller than the broken Q4_0. Unsloth ships only the UD-Q4_K_XL variant for these models precisely because plain Q4_0 degrades them.

The rule:

Ollama / llama.cpp → Unsloth UD-Q4_K_XL GGUFs.
vLLM / SGLang → Google’s official w4a16 compressed-tensors checkpoints.
Never → a DIY Q4_0 conversion of a QAT checkpoint.

Path 1: Ollama (easiest)

You need Ollama 0.22 or newer — earlier versions don’t parse the Gemma 4 QAT GGUF metadata correctly. Check your version first:

$ ollama --version
ollama version is 0.22.0

Pull and run the Unsloth GGUF straight from HuggingFace:

ollama run hf.co/unsloth/gemma-4-26B-A4B-it-qat-GGUF:UD-Q4_K_XL

Two environment variables matter for fitting the model and its context on a 16GB card. Flash attention reduces memory use during attention, and a quantized KV cache shrinks the per-token context cost:

export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
# restart the Ollama server after setting these

Gemma 4 has recommended sampling parameters — using the defaults will give you flatter, less coherent output. Set them in your request or a Modelfile:

temperature 1.0
top_p 0.95
top_k 64

That’s the entire setup. On a 16GB card you’ll get interactive speeds with the 26B-A4B because only ~4B parameters activate per token.

Path 2: llama.cpp (most control)

If you want speculative decoding or fine-grained offload control, go straight to llama.cpp. The -hf flag pulls the GGUF for you:

./build/bin/llama-server \
  -hf unsloth/gemma-4-12B-it-qat-GGUF:UD-Q4_K_XL \
  -ngl 999 \
  -fa on \
  --host 0.0.0.0 --port 8080

-ngl 999 offloads all layers to GPU; -fa on enables flash attention. The 12B QAT model fits comfortably in ~7GB, so even an 8GB card handles it with room for context.

The Gemma 4 GGUFs ship with a multi-token-prediction (MTP) draft head, so you can turn on speculative decoding without a separate draft model:

./build/bin/llama-server \
  -hf unsloth/gemma-4-26B-A4B-it-qat-GGUF:UD-Q4_K_XL \
  --spec-type draft-mtp --spec-draft-n-max 4 \
  -ngl 999 -fa on

This drafts up to 4 tokens ahead per step and verifies them in one pass — a real throughput win on memory-bound consumer GPUs.

Path 3: vLLM (throughput / multi-user)

For a server handling many concurrent requests, vLLM with Google’s w4a16 compressed-tensors checkpoint is the move. vLLM reads the quantization config from the checkpoint, so you don’t pass a quant flag:

vllm serve google/gemma-4-31B-it-qat-w4a16-ct \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

There’s one important asymmetry. The 26B-A4B MoE has no w4a16 checkpoint — its expert dimensions are small enough that 4-bit weights lose too much quality. For the MoE on vLLM, quantize online to int8 instead, which costs more memory but keeps quality intact:

vllm serve google/gemma-4-26B-A4B-it \
  --quantization int8_per_channel_weight_only \
  --max-model-len 32768

int8 here gives roughly 47% memory savings versus BF16 — less aggressive than 4-bit, but the right call for this particular model.

If your hardware can’t hold the 31B at usable context, a cloud GPU is the pragmatic option for bursty workloads. An A100 or H100 instance on RunPod runs the full 31B w4a16 with the entire 256K context for a few dollars an hour. For the consumer-card side of this decision — which GPU to actually buy for Gemma 4 QAT — runaihome.com has a dedicated Gemma 4 QAT hardware guide.

A problem I hit, and the fix

First run of the 26B-A4B on a 16GB card OOM’d at ~12K context even though the weights are only 15GB. The culprit was the default f16 KV cache eating the remaining headroom fast. Two changes fixed it:

OLLAMA_KV_CACHE_TYPE=q8_0 (or -fa on plus a quantized cache in llama.cpp) roughly halves per-token context memory.
Cap context to what you actually use. --max-model-len 16384 instead of the full 256K freed enough VRAM to run without spilling to system RAM.

Self-hosting QAT models is far more often a KV-cache budgeting problem than a weights problem. The weights are tiny now; context is what fills the card.

Backend comparison

	Ollama	llama.cpp	vLLM
Setup effort	Lowest	Medium	Medium-high
Best quant	Unsloth UD-Q4_K_XL	Unsloth UD-Q4_K_XL	Google w4a16 / int8 MoE
Concurrency	Single-user	Single-user	Many requests
Speculative decoding	Limited	Yes (MTP)	Yes
Best for	Daily local chat	Tinkering, low VRAM	Teams, API serving

The license caveat

Gemma 4 is not released under an OSI-approved open-source license. It ships under Google’s Gemma Terms of Use, which permit free commercial use and redistribution but attach a use-restrictions policy you must pass through to downstream users. That’s more permissive than most “open weight” terms and fine for the vast majority of self-hosters — but if your project requires a true FOSS license (MIT/Apache), Gemma 4 doesn’t qualify. For Apache-2.0 alternatives at similar sizes, look at Qwen3.6 or Codestral 2.

When NOT to use Gemma 4 QAT

You need maximum quality and have the VRAM. The BF16 or w4a16 checkpoints edge out QAT GGUFs on the hardest tasks. QAT is a memory play, not a quality upgrade over full precision.
Your task is long-context retrieval at full 256K. On a consumer card the KV cache for 256K won’t fit alongside the weights; you’ll be capped well below the advertised window.
You require an OSI license. See the caveat above — pick an Apache/MIT model instead.
You want to fine-tune and re-quantize. QAT checkpoints are awkward to convert; start from the standard weights if training is your goal.

FAQ

Do I need Ollama 0.22 specifically? 0.22 is the first version that parses the Gemma 4 QAT GGUF metadata correctly. Older versions may load the file but misread quantization scales. Upgrade before pulling.

Why is the 26B-A4B so much faster than its size suggests? It’s a Mixture-of-Experts model. Only ~4B of its 26B parameters activate per token, so inference runs at roughly 4B-model speed while the full 26B is available for knowledge.

Can I just convert the HuggingFace checkpoint myself? You can, but you shouldn’t. Naive Q4_0 conversion of a QAT checkpoint loses significant accuracy because of a scale mismatch. Use Unsloth’s UD-Q4_K_XL GGUFs or Google’s w4a16 checkpoints.

Why does vLLM treat the MoE differently? The 26B-A4B’s expert layers are too small to survive 4-bit weight quantization, so Google didn’t publish a w4a16 version. Use int8_per_channel_weight_only online quantization for that model on vLLM.

What’s the real-world quality gap between UD-Q4_K_XL and a bad Q4_0? On the 26B-A4B reference eval, the naive Q4_0 scored around 70.2% while Unsloth’s Dynamic GGUF recovered to about 85.6% — and was smaller on disk. That gap is the whole reason this guide exists.

Sources

Recommended Gear

RTX 4060 Ti 16GB — cheapest card that fits the 12B and 26B-A4B QAT models with context headroom.
RTX 3090 — 24GB used-market value pick for the 31B dense model.
RTX 4090 — fastest single-card option for 31B at longer context.

Was this article helpful?