Jun 20, 2026

ZAYA1-8B Review 2026: Apache 2.0 Reasoning MoE on AMD

By AIFoss · 9 min read

zaya1llmreasoninglocal-llmselfhostedmoeamd

TL;DR: ZAYA1-8B is an Apache 2.0 Mixture-of-Experts reasoning model from Zyphra — 8.4B total parameters, ~760M active, trained entirely on AMD Instinct MI300X. It posts frontier-class math scores for its size, but it needs Zyphra’s custom forks of vLLM or transformers to run, and there’s no official GGUF yet. Great weights, awkward to self-host today.

	ZAYA1-8B	Qwen3-4B-Thinking-2507	Gemma-4-E4B-it
Best for	Math/reasoning density per param	Drop-in Ollama reasoning	Multimodal + easy local use
Active params	~760M (8.4B total MoE)	4B dense	~4B effective
Install complexity	High (custom vLLM/transformers fork)	Low (`ollama run`)	Low (`ollama run`)
License	Apache 2.0	Apache 2.0	Gemma terms
GGUF / Ollama	None official (June 2026)	Yes	Yes
Min VRAM (bf16)	~17 GB weights, ~48 GB w/ vLLM defaults	~9 GB	~8 GB

Honest take: If you want the best math-per-watt open weights of mid-2026 and you’re comfortable on AMD or building from a fork, ZAYA1-8B is genuinely special. If you just want a reasoning model running tonight, ollama run qwen3:4b beats it on convenience by a mile.

What ZAYA1-8B actually is

Zyphra released ZAYA1-8B on May 6, 2026 under the Apache 2.0 license, with weights on Hugging Face and a technical report on arXiv. The headline isn’t the size — it’s the efficiency. This is a sparse Mixture-of-Experts model with 8.4B total parameters but only about 760M active per token. The pitch is “maximum intelligence density per parameter,” and unusually for a 2026 frontier-adjacent model, it was pretrained 100% on AMD Instinct MI300X GPUs rather than NVIDIA hardware.

That AMD detail isn’t marketing fluff. Most open-weight models are trained on NVIDIA clusters, and the software stack reflects that. ZAYA1 is a proof point that a full pretraining run — including long-context extension — works end to end on AMD silicon with Pensando networking on IBM Cloud. If you care about hardware diversity in the open ecosystem, this matters.

Three architecture changes carry the model, all part of what Zyphra calls its MoE++ stack:

Compressed Convolutional Attention (CCA) — attention that operates in a compressed latent space and achieves roughly 8× KV-cache compression versus standard attention. The KV cache is the per-token memory the model holds during generation; cutting it 8× is what makes long context affordable.
An MLP-based expert router with PID-controller bias balancing, which keeps expert utilization stable instead of collapsing onto a few experts.
Learned residual scaling, a small but real contributor to training stability at this sparsity.

It was trained at up to 32k context length, with context-parallel techniques used to push effective context further (the report describes scaling to 131K with eight ranks).

The benchmarks — and the asterisk

ZAYA1-8B punches well above its active-parameter count on math and reasoning. The numbers Zyphra published:

Benchmark	ZAYA1-8B	Comparison
AIME’25	91.9	with Markovian RSA test-time compute
HMMT’25	89.6	vs Claude 4.5 Sonnet 88.3
GPQA-Diamond	71.0	knowledge/reasoning
AIME’26	89.1	vs Mistral-Small-4-119B 86.4
HMMT Feb’26	71.6	vs Mistral-Small-4-119B 70.6

Read that Mistral row again: a model with 760M active parameters edging out a 119B model on competition math. Zyphra also reports ZAYA1-8B beating Qwen3-4B-Thinking-2507 and Gemma-4-E4B-it across math and coding categories, and staying competitive with first-generation frontier reasoning models like DeepSeek-R1-0528 and Gemini 2.5 Pro.

Here’s the asterisk you need before you quote these at work: the top-line results that approach Claude 4.5 Sonnet lean on Markovian RSA, Zyphra’s test-time-compute scheme. Markovian RSA generates parallel reasoning traces and recursively aggregates them while carrying forward only a bounded ~4K-token “tail” between rounds — so you get long effective reasoning without unbounded memory growth. That’s clever, and the constant-memory property is the real engineering win. But it means those scores reflect extra inference budget, not a single greedy pass. Without that budget, ZAYA1-8B is still strong for its size, but the gap to frontier models widens. Anyone comparing it to a vanilla pass@1 number from another model is comparing apples to oranges.

Self-hosting reality check (read this before you download)

This is where the review gets practical, and where most “ZAYA1-8B is the new local king” posts go quiet. The custom architecture that makes the model efficient also means it does not run on stock vLLM or stock transformers as of June 2026. The supported paths are Zyphra’s own forks.

A minimal transformers-fork load looks like this:

# Install Zyphra's transformers fork (CCA + MoE++ router not in upstream yet)
pip install "transformers @ git+https://github.com/Zyphra/transformers.git"

python - <<'PY'
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Zyphra/ZAYA1-8B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

msg = [{"role": "user", "content": "Prove there are infinitely many primes."}]
ids = tok.apply_chat_template(msg, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=1024)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
PY

Expected behavior on first run: a multi-GB shard download, then a chain-of-thought style answer. Confirm the exact repo path and fork URL on the model’s Hugging Face card before copying — Zyphra’s repo names have moved during the preview window.

VRAM, honestly. At bf16, 8.4B parameters is only about 17 GB of weights. But in real testing, served through vLLM on an NVIDIA RTX 6000, the process consumed roughly 47 GB once fully loaded. That’s not the model being secretly huge — it’s vLLM’s default gpu_memory_utilization pre-allocating most of the card for the KV cache and paged-attention pool. You can dial that down (--gpu-memory-utilization 0.5) and fit it in far less, but plan for a 24 GB card minimum at bf16 and don’t be surprised when vLLM grabs everything you give it. Community quantizations (BNB and MXFP4 builds) have started appearing on Hugging Face, which bring the footprint down further, but they’re unofficial.

No Ollama, no llama.cpp — yet. There is no official GGUF from Zyphra, and the llama.cpp feature request (issue #22776) was still open with no merged implementation when I checked. CCA and the Markovian RSA sampler are non-trivial to port. If your entire workflow is ollama run, ZAYA1-8B is not ready for you this month. Watch that issue.

If you don’t own a 24 GB+ card and just want to try the weights without buying hardware, renting is the rational move — a single 48 GB cloud GPU on RunPod costs less than a coffee per hour and saves you the fork-wrangling on a fresh image. For the dual-consumer-GPU route, two used RTX 3090 cards give you 48 GB of pooled VRAM; our friends at runaihome.com cover those multi-GPU home-lab builds in depth.

Where it fits versus what you already run

If you’ve read our Ollama vs LM Studio vs llama.cpp comparison, you already know the convenience hierarchy: anything with a GGUF and an Ollama tag wins on time-to-first-token. ZAYA1-8B sits outside that comfort zone right now. So the question isn’t “is it better than Qwen3-4B-Thinking?” on a benchmark sheet — on math, it is — it’s “is the setup tax worth it for your workload?”

For pure math and structured reasoning where you can afford the Markovian RSA inference budget, the intelligence-per-active-parameter is class-leading, and the 8× KV-cache compression means long reasoning chains stay cheap. For general chat, coding agents, or anything you’d wire into a RAG pipeline today, a model with a stable Ollama tag and broad tooling support is the pragmatic pick until GGUF lands. The Apache 2.0 license is unambiguously good news either way — see our open-source LLM licensing guide for why “Apache 2.0” beats “open weights” for commercial self-hosting.

When NOT to use ZAYA1-8B

You want it running in 10 minutes. No official GGUF or Ollama tag means a fork install and dependency wrangling. Pick Qwen3 or Gemma 4 instead.
You’re on a strict 8–12 GB consumer GPU with no quant. Wait for stable community quantizations or rent a bigger card.
Your use case is coding agents or tool use. ZAYA1’s strengths are math and reasoning; it wasn’t pitched as a coding-agent model, and the agent tooling expects OpenAI-style endpoints that the forks don’t all expose cleanly yet.
You need vendor support or a frozen, reproducible stack. Running off active forks means breakage between commits is a real risk.
You’re benchmark-shopping for a single-pass score. The flagship numbers use test-time compute; budget accordingly.

Should you care?

Yes — but as a signal more than a daily driver. ZAYA1-8B is the clearest 2026 evidence that (a) AMD can train serious open models end to end, and (b) architecture work like CCA and Markovian RSA can extract frontier-class reasoning from under a billion active parameters. That’s a bigger deal long-term than any single leaderboard row.

For now, treat it as a research-grade download: brilliant weights, rough edges on packaging. The moment a clean GGUF and an Ollama tag appear, this jumps from “interesting” to “install it.” Until then, it’s a fork-and-vLLM project for people who enjoy that, and a “watch this space” for everyone else.

FAQ

Is ZAYA1-8B really free for commercial use? Yes. It’s released under Apache 2.0, which permits commercial use, modification, and redistribution with attribution. That’s the most permissive tier you’ll find among capable 2026 reasoning models.

Can I run ZAYA1-8B in Ollama? Not as of June 2026. There’s no official GGUF, and the llama.cpp support request (issue #22776) is still open. You currently need Zyphra’s vLLM or transformers fork. Community quantizations exist but are unofficial.

How much VRAM does ZAYA1-8B need? The bf16 weights are ~17 GB, but vLLM with default settings pre-allocated about 47 GB on an RTX 6000. Plan for a 24 GB card at minimum, lower gpu-memory-utilization to fit smaller, or use a community quant.

What is Markovian RSA and why does it matter for the scores? It’s Zyphra’s test-time-compute method that aggregates parallel reasoning traces while carrying only a ~4K-token tail between rounds, keeping memory constant. The headline AIME/HMMT scores use it, so they reflect extra inference budget rather than a single greedy pass.

Does it really beat Claude 4.5 Sonnet? On HMMT’25, Zyphra reports 89.6 for ZAYA1-8B versus 88.3 for Claude 4.5 Sonnet — with Markovian RSA. On general capability it’s not a Claude replacement; the result is narrow and math-specific, which is still remarkable for 760M active parameters.

Sources

Was this article helpful?