Jun 9, 2026

Qwen3-Coder-Next Local Setup Guide 2026: Ollama and GGUF

By AIFoss · 11 min read

ollamallmcodingselfhostedaigguf

TL;DR: Qwen3-Coder-Next hits 70.6% on SWE-bench Verified with only ~3B active parameters per inference pass — the efficiency gap vs. dense coding models has never been wider. It runs on Ollama with a single pull command, but you need roughly 48 GB of combined RAM/VRAM for Q4_K_M. The tradeoff: inference cost close to a 7B model, benchmark performance close to the frontier.

What you’ll have running after this guide:

Qwen3-Coder-Next serving a local OpenAI-compatible API via Ollama
Cline or Continue.dev wired to that endpoint for in-editor code generation
A llama.cpp fallback path for CPU-only machines with 64+ GB RAM

Why Qwen3-Coder-Next

Most open-source coding models force a choice between capability and hardware budget. Dense 70B models need 40+ GB of VRAM at Q4_K_M. Dense 7–14B models fit on a single consumer GPU but trail the frontier by a meaningful margin on real repo-level tasks.

Qwen3-Coder-Next navigates around this constraint. It’s an 80B Mixture-of-Experts (MoE) model from Alibaba’s Qwen team (released February 2026), with only ~3B parameters active per forward pass. You pay inference cost similar to a dense 8B model, while the 80B total weight pool gives the model far more specialized knowledge to draw from. The result: 70.6% on SWE-bench Verified, above GPT-4o and within range of the best proprietary coding agents.

License: Apache 2.0 — commercial use unrestricted.

Architecture, from the QwenLM/Qwen3-Coder GitHub repo:

Property	Value
Total parameters	80B
Active per forward pass	~3B
Expert count	64 (top-2 routing per token)
Layers	48 (hybrid GatedDeltaNet + MoE)
Context length	256K native, up to 1M with Yarn
SWE-bench Verified	70.6%
License	Apache 2.0

The 256K context is the practical differentiator for agentic coding. Loading a full feature branch across 10 files stops being a concern.

Hardware Requirements

MoE models front-load the memory. You pay for all 80B weights at load time even though only ~3B activate per token. GGUF quantization cuts the weight footprint significantly — here’s what each quantization tier actually needs:

Quantization	Est. File Size	Min VRAM + RAM	Notes
FP16	~174 GB	174 GB VRAM	Multi-GPU server only
FP8	~85 GB	85 GB VRAM	Dual A100 / H100
Q4_K_M	~48 GB	48 GB combined	Recommended local quant
Q3_K_M	~36 GB	36 GB combined	24 GB GPU + 16 GB RAM
Q2_K	~24 GB	30 GB combined	Last resort, visible quality drop

“Combined” means VRAM plus system RAM through llama.cpp’s split offload. A 24 GB GPU (RTX 3090) with 32 GB of DDR4/DDR5 system RAM can run Q4_K_M — roughly 40 layers on GPU, the rest on CPU. Expect 8–15 t/s in that split config. A machine with 48+ GB of unified VRAM (dual RTX 3090, Mac Studio M4 Max 128 GB, or a single A6000 48 GB) can load the full Q4_K_M on-device at 20–40 t/s.

If your hardware doesn’t reach these numbers, RunPod has A100 80 GB instances starting at ~$2.30/hr that run Q4_K_M fully on-GPU — useful for burst tasks or initial evaluation before committing to hardware.

Option 1: Ollama (Recommended)

Ollama wraps the model as a local API server with automatic GGUF download, GPU detection, and memory splitting. The path from zero to a working endpoint is:

Install

# Linux / macOS
curl -fsSL https://ollama.com/install.sh | sh

# verify
ollama --version

On Windows, download the installer from ollama.com. Ollama runs natively with CUDA — WSL2 is not required.

Pull and Run

ollama pull qwen3-coder-next
ollama run qwen3-coder-next

The default pull grabs a Q4_K_M GGUF (~48 GB download). The first run loads the weights and opens an interactive prompt:

>>> rewrite this loop to avoid the off-by-one: for i in range(1, len(arr)): print(arr[i+1])

The index arr[i+1] goes out of bounds when i == len(arr)-1. Fix:

for i in range(len(arr)):
    print(arr[i])

The HTTP API starts automatically on http://localhost:11434 — OpenAI-compatible, no configuration needed.

Reduce Context to Save VRAM

Ollama pre-allocates KV cache for the full declared context on load. At 256K tokens, that cache alone can push a 24 GB GPU over its limit. For most coding tasks, 32K context is enough — entire files, full error logs, surrounding functions.

# one-shot override
OLLAMA_NUM_CTX=32768 ollama run qwen3-coder-next

# ask the model something
>>> what does the render() method in this file do?

For a persistent config, write a Modelfile:

FROM qwen3-coder-next

PARAMETER num_ctx 32768
PARAMETER num_gpu 999
PARAMETER temperature 0.15

ollama create qwen3-coder-32k -f Modelfile
ollama run qwen3-coder-32k

num_gpu 999 tells Ollama to offload as many layers as VRAM allows. Set it to a specific integer (e.g., num_gpu 30) if you want to cap GPU layers and control CPU spillover explicitly.

Fix: CUDA Out of Memory

If Ollama exits with CUDA error: out of memory, work through these in order:

1. Context too large. Add PARAMETER num_ctx 8192 to your Modelfile. The KV cache is the main culprit on 24 GB cards.

2. Wrong quantization. Check if a smaller quant tag is available: ollama pull qwen3-coder-next:q3_k_m. Lower quant = smaller footprint, some quality loss.

3. VRAM already occupied. Run nvidia-smi to see what’s using VRAM. Close other models, browsers with WebGL, and any AI image gen tools running in the background.

4. Need to split to CPU. Add OLLAMA_GPU_OVERHEAD=1073741824 to your environment to reserve 1 GB of VRAM headroom before Ollama starts mapping layers. This tells the runtime to be more conservative about what fits on GPU.

Option 2: llama.cpp (CPU-Only or Fine-Grained GPU Control)

llama.cpp is the right choice if you’re on a CPU-only machine with 64+ GB RAM, or if you need manual control over the GPU/CPU layer split.

Build

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON     # drop CUDA flag for CPU-only
cmake --build build --config Release -j$(nproc)

Download the GGUF

The Unsloth GGUF repo hosts Q2_K through Q8_0 variants:

pip install huggingface_hub

huggingface-cli download unsloth/Qwen3-Coder-Next-GGUF \
  --include "Qwen3-Coder-Next-Q4_K_M.gguf" \
  --local-dir ./models

Replace Q4_K_M with Q3_K_M for tighter VRAM budgets or Q5_K_M if you have the headroom and want slightly sharper code generation. See the GGUF quantization guide for a primer on the format tradeoffs.

Start the Server

./build/bin/llama-server \
  --model ./models/Qwen3-Coder-Next-Q4_K_M.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 40 \
  --chat-template qwen3 \
  --port 8080

--n-gpu-layers 40 offloads 40 transformer layers to GPU; the remainder runs on CPU RAM. Tune this number to your VRAM: start at half the model’s 48 layers and work up until you see a memory error, then back off by 5. Set to 999 to attempt full GPU offload.

The server exposes /v1/chat/completions on port 8080 — the same interface Ollama uses, so any tool configured for one works with the other.

Wire It Into Your Editor

Cline (VS Code)

In VS Code, open the Cline extension settings:

API Provider: OpenAI Compatible
Base URL: http://localhost:11434/v1
Model ID: qwen3-coder-next (or qwen3-coder-32k if you used the Modelfile)
API Key: ollama (any non-empty string)

The Cline Setup Guide 2026 covers .clinerules, plan/act mode tuning, and auto-approve thresholds in detail.

Continue.dev

In .continue/config.json:

{
  "models": [
    {
      "title": "Qwen3-Coder-Next",
      "provider": "ollama",
      "model": "qwen3-coder-next",
      "contextLength": 32768
    }
  ]
}

Set contextLength explicitly — without it, Continue.dev may attempt the full 256K, which causes OOM on 24 GB cards. The Continue.dev + Ollama Setup Guide has the full configuration walkthrough including autocomplete and inline diff settings.

Aider

pip install aider-chat

aider --model ollama/qwen3-coder-next \
      --ollama-api-base http://localhost:11434

Aider’s tree-sitter integration sends only the relevant code context rather than entire files, which keeps actual token usage well below 32K for most repos even when you’ve set a 256K window. See the Aider Setup Guide 2026 for repo map configuration and .aiderignore setup.

Performance by Hardware

These figures are from community reports in the unsloth/Qwen3-Coder-Next-GGUF HuggingFace discussions and r/LocalLLaMA threads (June 2026). Actual numbers depend on CPU frequency, RAM bandwidth, and active context size.

Hardware Setup	Quant	Est. Speed	Best For
RTX 4090 24 GB only	Q3_K_M	15–25 t/s	Single-dev coding agent
RTX 3090 + 32 GB RAM	Q4_K_M	8–15 t/s	Editing, short completions
Mac Studio M4 Max 128 GB	Q4_K_M (MLX)	22–28 t/s	Best consumer single-node
96 GB RAM, CPU-only	Q4_K_M	8–12 t/s	No-GPU fallback
RunPod A100 80 GB	Q4_K_M	40–55 t/s	Burst tasks, large context

Note: MLX-optimized inference on Apple Silicon requires running via MLX rather than Ollama’s default GGUF backend. Check the Ollama MLX backend guide for the setup steps — it roughly doubles tok/s on M-series hardware.

The MoE routing means tokens-per-second scales better with multiple GPU layers than a comparable dense model, because routing decisions are cheap. Once Q4_K_M is fully on-device, the active 3B compute path is fast.

When NOT to Use Qwen3-Coder-Next

Less than 30 GB combined RAM + VRAM. Q3_K_M is the practical floor. Below that, code quality degrades in ways that matter — incomplete function bodies, wrong import paths, hallucinated library APIs. At that budget, Qwen3-Coder-30B-A3B-Instruct (ollama pull qwen3-coder:30b-a3b) is the right alternative — same MoE efficiency in an 18 GB package.

IDE autocomplete with sub-100ms latency requirements. The 80B weight load takes 15–30 seconds to initialize and memory-maps a large file on each startup. For completion-as-you-type, a dense 7B model like Qwen2.5-Coder-7B responds faster despite lower benchmark scores. Qwen3-Coder-Next shines at task-level agentic work, not keystroke-level suggestions.

Context beyond 256K. Ollama doesn’t currently expose the Yarn rope-scaling config needed for the 1M extension. For very long-context jobs, use the llama.cpp server directly with --rope-scaling yarn --rope-scale 4.

Multi-user team deployments. Ollama serializes requests — one user at a time. For a team of 3+, the throughput bottleneck becomes real. Switch to vLLM with MoE kernel support, covered in the vLLM Setup Guide 2026. For the broader team stack decision, the self-hosted AI stack for dev teams guide lays out code completion + chat + RAG together.

FAQ

Does Qwen3-Coder-Next support tool calling?

Yes — the model was agentic-trained at scale and supports OpenAI-format function calling. Ollama exposes this through /api/chat. Cline and Aider both use it automatically when the provider reports tool support. The model works as a drop-in for any agent scaffold that expects an OpenAI-compatible endpoint.

Can I run it on a single RTX 4090?

Not fully in VRAM. Q4_K_M needs ~48 GB; the 4090’s 24 GB holds roughly half the layers. Ollama splits the remainder to CPU RAM and continues working — just slower. Expect 8–18 t/s on a 4090 + 32 GB RAM split config, versus 35+ t/s on a fully GPU-resident setup.

What’s the difference between Qwen3-Coder-Next and Qwen3-Coder-480B?

Qwen3-Coder-480B-A35B is the larger MoE variant (480B total / 35B active) aimed at server deployments and inference providers. Qwen3-Coder-Next (80B / ~3B active) is the consumer-hardware variant. Both carry Apache 2.0 licenses. The 480B model isn’t realistic for single-machine local use.

Is thinking mode available?

Yes. Append /think to your prompt or set enable_thinking: true in the chat API payload. Thinking adds noticeable latency — budget 30–60 extra seconds on a 15 t/s system — but measurably improves performance on complex multi-step problems. For fast line-level edits, skip it.

How does this compare to other open-source coding models?

At 70.6% SWE-bench Verified, Qwen3-Coder-Next leads the open-source field as of its February 2026 release. Devstral Small 2 (68%, 24B) and DeepSeek-V3 are the nearest open-weight alternatives. Cloud agents (Claude Code, Cursor) still lead on complex multi-file architectural tasks — local is a cost and privacy trade-off, not a capability claim. The open-source coding agents overview benchmarks these in context.

Sources

Recommended Gear

Was this article helpful?