Qwen3.6-35B-A3B Local Setup 2026: Ollama and 24GB VRAM
TL;DR: Qwen3.6-35B-A3B is a 35B Mixture-of-Experts model with only ~3B active parameters per token, so it loads like a 35B model but runs at the speed of a 3B one. At Q4_K_M it needs roughly 21 GB, so a single 24 GB GPU runs it with usable context. The catch: long context blows up the KV cache fast, and you have to manage that.
What you’ll have running after this guide:
- Qwen3.6-35B-A3B serving a local OpenAI-compatible API through Ollama
- Cline or Continue.dev wired to that endpoint for in-editor agentic coding
- A llama.cpp fallback with MoE CPU offload for GPUs smaller than 24 GB
Why this model is worth the setup
Open-weight coding models usually force a tradeoff: dense 30B+ models give you frontier-ish quality but want 20–40 GB of VRAM, while the 7–14B dense models fit a consumer card but trail on real repo-level work.
Qwen3.6-35B-A3B routes around that. It’s a sparse MoE from Alibaba’s Qwen team, released April 16, 2026 under the Apache 2.0 license — commercial use unrestricted. There are 35B total parameters spread across 256 experts (8 routed plus 1 shared active per token), but only about 3B parameters fire on any given forward pass. You pay roughly the inference cost of a 3B dense model while the full 35B weight pool supplies the knowledge.
The numbers Qwen published back that up:
| Benchmark | Score |
|---|---|
| SWE-bench Verified | 73.4% |
| Terminal-Bench 2.0 | 51.5% |
| AIME 2026 | 92.6% |
| Context (native) | 262K tokens |
| Context (with YaRN) | up to 1M |
| License | Apache 2.0 |
The architecture mixes Gated DeltaNet linear attention with standard gated attention and the sparse MoE block, and it’s trained with Multi-Token Prediction. The model is also natively multimodal — it accepts text and images — which matters less for pure coding but is there if you want it.
If you’ve already set up Qwen3-Coder-Next locally, the workflow here is nearly identical, just with a smaller memory footprint.
Hardware requirements
MoE models front-load memory. Even though only ~3B parameters activate per token, all 35B weights sit in memory at load time. Quantization is what makes this fit on a single card. Here’s the realistic picture, drawn from community VRAM measurements and the GGUF file sizes on Hugging Face:
| Quant | Approx. size | Min. VRAM (short ctx) | Fits on |
|---|---|---|---|
| UD-Q4_K_XL (Unsloth dynamic) | ~19–20 GB | ~22 GB | 24 GB GPU |
| Q4_K_M | ~21 GB | ~24 GB | 24 GB GPU / 32 GB Mac |
| Q5_K_M | ~25 GB | ~28 GB | 32 GB Mac / 2× GPU |
| Q8_0 | ~37 GB | ~40 GB | 48 GB GPU / 64 GB Mac |
| BF16 | ~70 GB | ~80 GB | A100 80GB / 96 GB Mac |
For a single 24 GB card — an RTX 3090, RTX 4090, or RTX 5090 — Q4_K_M is the sweet spot. Unsloth’s dynamic UD-Q4_K_XL quant trims another 1–2 GB by quantizing non-critical tensors harder, which buys you headroom for context. On Apple Silicon, a Mac with 32 GB unified memory handles Q4_K_M comfortably.
One thing to plan for: that ~21 GB is weights only. The KV cache grows with context length, and at the model’s full 262K window it can add tens of GB. On a 24 GB card you’ll run Q4_K_M with maybe 16K–32K of context, not the full 262K. More on that in the OOM section.
If you don’t own a 24 GB card and don’t want to buy one, renting an RTX 4090 or A100 on RunPod costs a fraction of a dollar per hour and skips the local memory math entirely. For a deeper look at which GPU to actually buy for local LLMs, see the home lab GPU guides on runaihome.com.
Step 1: Install Ollama and pull the model
Ollama is the lowest-friction path. Install it (Linux shown; macOS and Windows have native installers):
curl -fsSL https://ollama.com/install.sh | sh
Confirm it’s running:
$ ollama --version
ollama version is 0.17.1
Then pull the model. The default tag is the Q4_K_M-class build:
ollama pull qwen3.6:35b-a3b
That download is about 24 GB, so give it a few minutes. Ollama publishes several tags if you want a specific footprint:
qwen3.6:35b-a3b— default, ~24 GBqwen3.6:35b-a3b-q4_K_M— explicit Q4_K_Mqwen3.6:35b-a3b-mxfp8— MXFP8, larger, higher qualityqwen3.6:35b-a3b-bf16— full precision, needs 80 GB-class memory
Stick with the default unless you have a reason not to.
Step 2: Run it and check the output
Start an interactive session:
$ ollama run qwen3.6:35b-a3b
>>> Write a Python function that returns the nth Fibonacci number using memoization.
def fib(n, _cache={0: 0, 1: 1}):
if n not in _cache:
_cache[n] = fib(n - 1) + _cache.setdefault(n - 2, fib(n - 2))
return _cache[n]
On a 24 GB GPU you should see generation in the ballpark of 40–70 tokens/second once the model is loaded — that MoE sparsity is the reason a “35B” model feels this quick. The first response is slower because the weights are loading into VRAM.
To run it as a background server with the OpenAI-compatible API (what the editor integrations need):
ollama serve
That exposes an endpoint at http://localhost:11434/v1. Quick sanity check with curl:
curl http://localhost:11434/v1/chat/completions -d '{
"model": "qwen3.6:35b-a3b",
"messages": [{"role": "user", "content": "Say hello in one word."}]
}'
Sampling settings
Qwen ships recommended sampling parameters on the model card — check it rather than guessing, because the 3.x series separates “thinking” and “non-thinking” defaults. As a rule, lower temperature (around 0.6–0.7) suits coding and agentic work where you want determinism; greedy decoding (temperature 0) tends to cause repetition loops on this architecture, so avoid it. Set these in your client or in an Ollama Modelfile with PARAMETER temperature 0.7.
Step 3: Wire it into your editor
The point of a local coding model is using it where you write code. Two solid open-source options, both covered in depth on our sister site for AI coding tools, aicoderscope.com:
Cline (VS Code agent): Install the Cline extension, open settings, choose “Ollama” as the API provider, set the base URL to http://localhost:11434, and pick qwen3.6:35b-a3b from the model list. Our Cline setup guide walks through the agentic config in detail.
Continue.dev (inline + chat): Add a block to ~/.continue/config.json:
{
"models": [
{
"title": "Qwen3.6-35B-A3B",
"provider": "ollama",
"model": "qwen3.6:35b-a3b",
"apiBase": "http://localhost:11434"
}
]
}
Continue then uses it for chat, edits, and codebase questions. The Continue.dev + Ollama guide covers autocomplete config too.
For agentic tasks, the 262K context window is the real draw — you can feed a whole feature branch and let the model reason across files. Just remember the KV cache cost.
Step 4: The low-VRAM fallback (llama.cpp + CPU offload)
No 24 GB card? Because only ~3B parameters are active per token, you can keep the bulk of the MoE experts in system RAM and only stream the active path onto the GPU. llama.cpp supports this with --n-cpu-moe. One community writeup reported running this model at roughly 30 tokens/second on just 6 GB of VRAM by offloading MoE layers to CPU, given enough system RAM.
Build llama.cpp, grab a GGUF (Unsloth’s Qwen3.6-35B-A3B-GGUF repo), then:
./llama-server \
-m Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
--n-cpu-moe 28 \
--ctx-size 16384 \
--host 0.0.0.0 --port 8080
Tune --n-cpu-moe to how many MoE layers you push to CPU: higher number = less VRAM, slower generation. You’ll want 32+ GB of system RAM for this to be comfortable. If you’re new to choosing quants, the GGUF quantization guide explains the Q4/Q5/Q8 tradeoffs.
Real problem: CUDA out of memory at long context
The most common failure isn’t loading the weights — it’s the context window. You pull the model, it runs fine on short prompts, then you paste a large file or turn on a long agentic session and Ollama dies with:
Error: CUDA error: out of memory
This is the KV cache, not the weights. At 24 GB, the ~21 GB of Q4_K_M weights leave only ~3 GB for cache, and the default context plus a big prompt overruns it. Three fixes that actually work:
- Cap the context. Set
OLLAMA_CONTEXT_LENGTHto something your VRAM can hold — 16384 or 32768 rather than the full 262144:OLLAMA_CONTEXT_LENGTH=16384 ollama serve - Quantize the KV cache. Enable K/V cache quantization so the cache itself uses less memory:
This roughly halves cache memory for a small quality cost.OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve - Drop to UD-Q4_K_XL. Unsloth’s dynamic quant saves 1–2 GB of weight memory, which translates directly into more usable context on a 24 GB card.
If you’re still tight, the honest answer is more memory — a 32 GB Mac or a 48 GB GPU removes the cliff entirely.
When NOT to use Qwen3.6-35B-A3B
- You have a 16 GB or smaller GPU and want full speed. The CPU-offload path works but slows down. A dense model like Qwen3.6-27B at Q4, or a smaller coder, will give you better tokens/second without the juggling.
- You only need vision or single-shot chat. The MoE machinery is overkill for simple tasks; a 7–8B dense model is lighter and just as good for short prompts.
- You need the full 262K context on a single 24 GB card. You can’t have both Q4 weights and a 262K KV cache in 24 GB. That’s a 48 GB-class workload.
- You require greedy/deterministic decoding. This architecture repeats under temperature 0; if your pipeline demands deterministic output, test carefully or pick a different model.
For most developers with a 24 GB GPU who want a capable, Apache-licensed agentic coder running entirely offline, though, this is one of the best options available in mid-2026.
FAQ
Is Qwen3.6-35B-A3B free for commercial use? Yes. It’s released under Apache 2.0, which permits commercial use, modification, and redistribution without royalty.
How much VRAM do I actually need? About 21 GB for the Q4_K_M weights, so plan on a 24 GB GPU or 32 GB of unified memory for usable context. Unsloth’s UD-Q4_K_XL trims that to ~19–20 GB.
Why is a 35B model this fast? It’s a Mixture-of-Experts model. Only ~3B of the 35B parameters activate per token, so generation speed is closer to a 3B dense model even though all weights occupy memory.
Can I run it on a Mac? Yes — a Mac with 32 GB unified memory runs Q4_K_M well via Ollama or llama.cpp. Community reports put a 20.9 GB Q4 build running on a MacBook Pro M5.
What’s the difference between this and Qwen3.6-27B? The 27B is a dense model: slower per token but lighter on memory and friendlier to 16 GB GPUs. The 35B-A3B MoE is faster on 24 GB+ cards and stronger for agentic, long-context work.
Sources
- Qwen3.6-35B-A3B — official Qwen blog
- Qwen/Qwen3.6-35B-A3B — Hugging Face model card
- qwen3.6:35b-a3b — Ollama library
- Qwen3.6 35B A3B specifications and VRAM — APXML
- Qwen3.6 — How to Run Locally, Unsloth Documentation
- Run Qwen3.6-35B-A3B on 6GB VRAM using llama.cpp — Minyang Chen, Medium
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →