Qwen3.6-35B-A3B Local Setup 2026: Ollama and 24GB VRAM

ollamallmcodingselfhostedaigguf

TL;DR: Qwen3.6-35B-A3B is a 35B Mixture-of-Experts model with only ~3B active parameters per token, so it loads like a 35B model but runs at the speed of a 3B one. At Q4_K_M it needs roughly 21 GB, so a single 24 GB GPU runs it with usable context. The catch: long context blows up the KV cache fast, and you have to manage that.

What you’ll have running after this guide:

  • Qwen3.6-35B-A3B serving a local OpenAI-compatible API through Ollama
  • Cline or Continue.dev wired to that endpoint for in-editor agentic coding
  • A llama.cpp fallback with MoE CPU offload for GPUs smaller than 24 GB

Why this model is worth the setup

Open-weight coding models usually force a tradeoff: dense 30B+ models give you frontier-ish quality but want 20–40 GB of VRAM, while the 7–14B dense models fit a consumer card but trail on real repo-level work.

Qwen3.6-35B-A3B routes around that. It’s a sparse MoE from Alibaba’s Qwen team, released April 16, 2026 under the Apache 2.0 license — commercial use unrestricted. There are 35B total parameters spread across 256 experts (8 routed plus 1 shared active per token), but only about 3B parameters fire on any given forward pass. You pay roughly the inference cost of a 3B dense model while the full 35B weight pool supplies the knowledge.

The numbers Qwen published back that up:

BenchmarkScore
SWE-bench Verified73.4%
Terminal-Bench 2.051.5%
AIME 202692.6%
Context (native)262K tokens
Context (with YaRN)up to 1M
LicenseApache 2.0

The architecture mixes Gated DeltaNet linear attention with standard gated attention and the sparse MoE block, and it’s trained with Multi-Token Prediction. The model is also natively multimodal — it accepts text and images — which matters less for pure coding but is there if you want it.

If you’ve already set up Qwen3-Coder-Next locally, the workflow here is nearly identical, just with a smaller memory footprint.

Hardware requirements

MoE models front-load memory. Even though only ~3B parameters activate per token, all 35B weights sit in memory at load time. Quantization is what makes this fit on a single card. Here’s the realistic picture, drawn from community VRAM measurements and the GGUF file sizes on Hugging Face:

QuantApprox. sizeMin. VRAM (short ctx)Fits on
UD-Q4_K_XL (Unsloth dynamic)~19–20 GB~22 GB24 GB GPU
Q4_K_M~21 GB~24 GB24 GB GPU / 32 GB Mac
Q5_K_M~25 GB~28 GB32 GB Mac / 2× GPU
Q8_0~37 GB~40 GB48 GB GPU / 64 GB Mac
BF16~70 GB~80 GBA100 80GB / 96 GB Mac

For a single 24 GB card — an RTX 3090, RTX 4090, or RTX 5090 — Q4_K_M is the sweet spot. Unsloth’s dynamic UD-Q4_K_XL quant trims another 1–2 GB by quantizing non-critical tensors harder, which buys you headroom for context. On Apple Silicon, a Mac with 32 GB unified memory handles Q4_K_M comfortably.

One thing to plan for: that ~21 GB is weights only. The KV cache grows with context length, and at the model’s full 262K window it can add tens of GB. On a 24 GB card you’ll run Q4_K_M with maybe 16K–32K of context, not the full 262K. More on that in the OOM section.

If you don’t own a 24 GB card and don’t want to buy one, renting an RTX 4090 or A100 on RunPod costs a fraction of a dollar per hour and skips the local memory math entirely. For a deeper look at which GPU to actually buy for local LLMs, see the home lab GPU guides on runaihome.com.

Step 1: Install Ollama and pull the model

Ollama is the lowest-friction path. Install it (Linux shown; macOS and Windows have native installers):

curl -fsSL https://ollama.com/install.sh | sh

Confirm it’s running:

$ ollama --version
ollama version is 0.17.1

Then pull the model. The default tag is the Q4_K_M-class build:

ollama pull qwen3.6:35b-a3b

That download is about 24 GB, so give it a few minutes. Ollama publishes several tags if you want a specific footprint:

  • qwen3.6:35b-a3b — default, ~24 GB
  • qwen3.6:35b-a3b-q4_K_M — explicit Q4_K_M
  • qwen3.6:35b-a3b-mxfp8 — MXFP8, larger, higher quality
  • qwen3.6:35b-a3b-bf16 — full precision, needs 80 GB-class memory

Stick with the default unless you have a reason not to.

Step 2: Run it and check the output

Start an interactive session:

$ ollama run qwen3.6:35b-a3b
>>> Write a Python function that returns the nth Fibonacci number using memoization.

def fib(n, _cache={0: 0, 1: 1}):
    if n not in _cache:
        _cache[n] = fib(n - 1) + _cache.setdefault(n - 2, fib(n - 2))
    return _cache[n]

On a 24 GB GPU you should see generation in the ballpark of 40–70 tokens/second once the model is loaded — that MoE sparsity is the reason a “35B” model feels this quick. The first response is slower because the weights are loading into VRAM.

To run it as a background server with the OpenAI-compatible API (what the editor integrations need):

ollama serve

That exposes an endpoint at http://localhost:11434/v1. Quick sanity check with curl:

curl http://localhost:11434/v1/chat/completions -d '{
  "model": "qwen3.6:35b-a3b",
  "messages": [{"role": "user", "content": "Say hello in one word."}]
}'

Sampling settings

Qwen ships recommended sampling parameters on the model card — check it rather than guessing, because the 3.x series separates “thinking” and “non-thinking” defaults. As a rule, lower temperature (around 0.6–0.7) suits coding and agentic work where you want determinism; greedy decoding (temperature 0) tends to cause repetition loops on this architecture, so avoid it. Set these in your client or in an Ollama Modelfile with PARAMETER temperature 0.7.

Step 3: Wire it into your editor

The point of a local coding model is using it where you write code. Two solid open-source options, both covered in depth on our sister site for AI coding tools, aicoderscope.com:

Cline (VS Code agent): Install the Cline extension, open settings, choose “Ollama” as the API provider, set the base URL to http://localhost:11434, and pick qwen3.6:35b-a3b from the model list. Our Cline setup guide walks through the agentic config in detail.

Continue.dev (inline + chat): Add a block to ~/.continue/config.json:

{
  "models": [
    {
      "title": "Qwen3.6-35B-A3B",
      "provider": "ollama",
      "model": "qwen3.6:35b-a3b",
      "apiBase": "http://localhost:11434"
    }
  ]
}

Continue then uses it for chat, edits, and codebase questions. The Continue.dev + Ollama guide covers autocomplete config too.

For agentic tasks, the 262K context window is the real draw — you can feed a whole feature branch and let the model reason across files. Just remember the KV cache cost.

Step 4: The low-VRAM fallback (llama.cpp + CPU offload)

No 24 GB card? Because only ~3B parameters are active per token, you can keep the bulk of the MoE experts in system RAM and only stream the active path onto the GPU. llama.cpp supports this with --n-cpu-moe. One community writeup reported running this model at roughly 30 tokens/second on just 6 GB of VRAM by offloading MoE layers to CPU, given enough system RAM.

Build llama.cpp, grab a GGUF (Unsloth’s Qwen3.6-35B-A3B-GGUF repo), then:

./llama-server \
  -m Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  --n-cpu-moe 28 \
  --ctx-size 16384 \
  --host 0.0.0.0 --port 8080

Tune --n-cpu-moe to how many MoE layers you push to CPU: higher number = less VRAM, slower generation. You’ll want 32+ GB of system RAM for this to be comfortable. If you’re new to choosing quants, the GGUF quantization guide explains the Q4/Q5/Q8 tradeoffs.

Real problem: CUDA out of memory at long context

The most common failure isn’t loading the weights — it’s the context window. You pull the model, it runs fine on short prompts, then you paste a large file or turn on a long agentic session and Ollama dies with:

Error: CUDA error: out of memory

This is the KV cache, not the weights. At 24 GB, the ~21 GB of Q4_K_M weights leave only ~3 GB for cache, and the default context plus a big prompt overruns it. Three fixes that actually work:

  1. Cap the context. Set OLLAMA_CONTEXT_LENGTH to something your VRAM can hold — 16384 or 32768 rather than the full 262144:
    OLLAMA_CONTEXT_LENGTH=16384 ollama serve
  2. Quantize the KV cache. Enable K/V cache quantization so the cache itself uses less memory:
    OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
    This roughly halves cache memory for a small quality cost.
  3. Drop to UD-Q4_K_XL. Unsloth’s dynamic quant saves 1–2 GB of weight memory, which translates directly into more usable context on a 24 GB card.

If you’re still tight, the honest answer is more memory — a 32 GB Mac or a 48 GB GPU removes the cliff entirely.

When NOT to use Qwen3.6-35B-A3B

  • You have a 16 GB or smaller GPU and want full speed. The CPU-offload path works but slows down. A dense model like Qwen3.6-27B at Q4, or a smaller coder, will give you better tokens/second without the juggling.
  • You only need vision or single-shot chat. The MoE machinery is overkill for simple tasks; a 7–8B dense model is lighter and just as good for short prompts.
  • You need the full 262K context on a single 24 GB card. You can’t have both Q4 weights and a 262K KV cache in 24 GB. That’s a 48 GB-class workload.
  • You require greedy/deterministic decoding. This architecture repeats under temperature 0; if your pipeline demands deterministic output, test carefully or pick a different model.

For most developers with a 24 GB GPU who want a capable, Apache-licensed agentic coder running entirely offline, though, this is one of the best options available in mid-2026.

FAQ

Is Qwen3.6-35B-A3B free for commercial use? Yes. It’s released under Apache 2.0, which permits commercial use, modification, and redistribution without royalty.

How much VRAM do I actually need? About 21 GB for the Q4_K_M weights, so plan on a 24 GB GPU or 32 GB of unified memory for usable context. Unsloth’s UD-Q4_K_XL trims that to ~19–20 GB.

Why is a 35B model this fast? It’s a Mixture-of-Experts model. Only ~3B of the 35B parameters activate per token, so generation speed is closer to a 3B dense model even though all weights occupy memory.

Can I run it on a Mac? Yes — a Mac with 32 GB unified memory runs Q4_K_M well via Ollama or llama.cpp. Community reports put a 20.9 GB Q4 build running on a MacBook Pro M5.

What’s the difference between this and Qwen3.6-27B? The 27B is a dense model: slower per token but lighter on memory and friendlier to 16 GB GPUs. The 35B-A3B MoE is faster on 24 GB+ cards and stronger for agentic, long-context work.

Sources

Was this article helpful?