Jun 8, 2026

Devstral Small 2 Review 2026: 68% SWE-bench on RTX 4090

By AIFoss · 11 min read

devstralmistrallocal-llmcodingollamaselfhosted

TL;DR: Devstral Small 2 is a 24B Apache 2.0 coding model from Mistral that scores 68% on SWE-bench Verified — a serious benchmark for a local model — and runs on a single RTX 4090. If you want an open-weight coding agent that keeps code on your machine and handles multi-file edits, this is the strongest 24B option available as of June 2026. The catch: it’s purpose-built for agentic software engineering tasks, not casual code completion.

	Devstral Small 2	Devstral 2 (123B)	Claude Sonnet 4.5
Best for	Local deployment, solo devs	Team servers, max quality	Cloud API, highest accuracy
SWE-bench Verified	68.0%	72.2%	77.2%
VRAM (Q4_K_M)	~14 GB	~70 GB+	API only
License	Apache 2.0	Modified MIT	Proprietary
Context window	256K	256K	200K
API cost (input/1M)	$0.10	$0.40	~$3.00

Honest take: For local coding agents with a single consumer GPU, Devstral Small 2 is the model to run in mid-2026. It won’t match Claude Sonnet 4.5 on hard tasks, but it costs you nothing per token and keeps your code off the internet.

What Devstral Small 2 Is

Mistral released Devstral Small 2 on December 9, 2025, alongside its larger sibling Devstral 2 (123B) and the Mistral Vibe CLI. Where Devstral 2 targets multi-GPU servers, Small 2 targets single-GPU workstations.

The model is fine-tuned specifically for software engineering agent tasks: exploring codebases, editing multiple files in a single pass, and calling tools in agentic loops. It handles those tasks differently from a general-purpose chat model — it’s optimized to read file trees, understand diffs, and apply targeted edits rather than generate boilerplate from scratch.

Key specs (tested on Devstral-Small-2-24B-Instruct-2512):

Parameters: 24B
Context window: 256K tokens
License: Apache 2.0 — commercial use allowed, no revenue threshold restrictions
Released: December 9, 2025
Ollama tag: devstral-small-2

The Apache 2.0 license is meaningful here. The 123B Devstral 2 ships under a modified MIT license that restricts organizations with over $20M in monthly revenue. Small 2 has no such clause — you can deploy it commercially without legal review.

Benchmark Reality Check

68.0% on SWE-bench Verified is the headline number. Here’s what that actually means.

SWE-bench Verified tests models on real GitHub issues from popular Python repositories. A successful “resolve” means the model read the issue, edited the codebase, and passed the existing test suite — without being given the solution. It’s a meaningful proxy for agentic software engineering capability.

For reference:

GPT-4o: ~38% at launch (early 2024 snapshot)
Claude Sonnet 3.5: ~49% at launch
Devstral Small 2 (24B, local): 68.0%
Devstral 2 (123B, API/server): 72.2%
Claude Sonnet 4.5 (API, current): 77.2%

A 4.2-point gap between Small 2 and the 123B version is smaller than you’d expect given a 5x parameter difference. The large gap vs. GPT-4o and older Claude versions reflects how much Mistral specialized this model for software agent tasks. General-purpose models trained to be chatty assistants perform worse on this benchmark than a 24B model trained specifically to edit files.

The benchmark also doesn’t tell you everything. On tasks requiring deep reasoning across a large unfamiliar codebase, or multi-file refactors that span many files, you’ll notice the quality gap between 68% and 77% more clearly. For standalone functions, unit tests, and targeted bug fixes, the difference is often imperceptible.

Installation: Ollama in 3 Commands

Ollama is the fastest path to running Devstral Small 2 locally. If you don’t have Ollama installed, the full Ollama setup guide covers it on Linux, macOS, and Windows.

# Pull the model (Q4_K_M by default, ~15 GB)
ollama pull devstral-small-2

# Run interactively
ollama run devstral-small-2

# Or specify a tag explicitly
ollama pull devstral-small-2:24b-instruct-2512-q4_K_M

Quantization options and VRAM requirements:

Quantization	Size	Min VRAM	Quality
Q4_K_M (default)	~15 GB	16 GB	Good for coding tasks
Q6_K	~20 GB	22 GB	Noticeably better on complex edits
Q8_0	~26 GB	28 GB	Near-lossless
FP16	~48 GB	50 GB	Reference quality, multi-GPU only

The Q4_K_M default fits comfortably on an RTX 4090 (24 GB). A Mac Mini M4 Pro with 48 GB unified memory can run Q8_0 with headroom. If you have a 16 GB GPU, Q4_K_M fits but you’ll be tight on context — longer files will cause slowdowns.

For a deeper look at how quantization levels affect output quality for coding tasks, see the GGUF quantization guide.

Once pulled, test the model:

ollama run devstral-small-2 "Write a Python function that finds all duplicate entries in a list of dicts by a given key."

Expect output immediately — Ollama’s built-in GGUF runtime handles tool-calling setup automatically.

Use It With Aider

Aider is where Devstral Small 2 actually shines. Its architect mode is a close match for how the model was designed to work: read the codebase, plan the edit, then apply it.

Install Aider if you haven’t already — the Aider setup guide covers full configuration. Then point it at your local Ollama instance:

# Via local Ollama
aider --model ollama/devstral-small-2:latest

# With explicit context and editor model split
aider \
  --model ollama/devstral-small-2:latest \
  --architect \
  --editor-model ollama/devstral-small-2:latest

The --architect flag puts Aider in a two-step mode: the first call plans the edit, the second applies it. This maps well to how Devstral was trained — it expects to reason about a file tree before making changes.

One practical note: Devstral Small 2 generates longer “thinking” sections when given open-ended architecture questions. For targeted bug fixes (aider --message "fix the race condition in worker.py line 42"), it’s fast and accurate. For open-ended feature requests on a large codebase, give it explicit file context with aider file1.py file2.py rather than letting it figure out which files to open on its own.

Use It With Continue.dev

Continue.dev can use Devstral Small 2 via Ollama’s OpenAI-compatible API. The model’s 256K context window is an advantage here — you can index larger files as context without hitting limits.

In VS Code, open your Continue config (~/.continue/config.json) and add:

{
  "models": [
    {
      "title": "Devstral Small 2 (local)",
      "provider": "ollama",
      "model": "devstral-small-2:latest",
      "apiBase": "http://localhost:11434"
    }
  ]
}

For the agent tab in Continue 0.9+, set it as the default agent model — the tool-calling support Devstral was trained on maps directly to Continue’s agent tool loop. In the VS Code sidebar, select the model from the dropdown and switch to Agent mode.

If you’re using Continue with a team and want a shared Ollama instance, run Ollama on the server with OLLAMA_HOST=0.0.0.0 ollama serve and point the apiBase at the server IP. See the Continue.dev + Ollama guide for multi-user setup details.

Use It With Mistral Vibe CLI

Mistral Vibe is the native CLI that shipped alongside Devstral 2. It’s open-source (MIT license, available on GitHub), built specifically for Devstral, and runs in your terminal without an IDE.

Install and configure it:

# Install via pip
pip install mistral-vibe

# Point at local Ollama (no API key needed)
export OPENAI_BASE_URL="http://localhost:11434/v1"
export OPENAI_API_KEY="ollama"

# Run in your project directory
vibe

Inside the Vibe session, you get file read/write tools, grep, git operations, and shell execution — all driven by Devstral’s tool-calling capability. The --offline flag disables any telemetry and prevents outbound connections.

Vibe is the lowest-friction path if you want a full coding agent loop without integrating Aider or Continue. The trade-off is that it’s newer and less tested than Aider at multi-file refactors. For greenfield code and bug fixes in small codebases, it works well out of the box.

If you want to run Devstral on a cloud GPU rather than locally — useful for testing before committing to hardware — RunPod supports deploying Ollama or vLLM on a single A100 or H100.

When NOT to Use Devstral Small 2

Don’t use it for general-purpose chat. Devstral is specifically fine-tuned for software engineering agent tasks. For summarizing documents, answering questions, or creative writing, a general-purpose 24B model (Qwen2.5-Instruct, Mistral-Small) will perform better.

Don’t use it as inline code completion. Models like Qwen2.5-Coder or DeepSeek-Coder are optimized for fast, token-efficient completions that fill in the next line. Devstral is optimized for multi-file planning and edits — it’s slower and more deliberate than you want for autocomplete.

Don’t expect Claude-level results on hard architecture tasks. The 9-point gap between Devstral Small 2 (68%) and Claude Sonnet 4.5 (77%) compounds on complex multi-file refactors or large unfamiliar codebases. If code quality at the top end matters for your use case, either run the full 123B Devstral 2 on a server or use the Mistral API.

Don’t use it on 8 GB VRAM. The Q4_K_M build needs ~14–15 GB of VRAM. It won’t fit on an RTX 3060 (12 GB) or anything smaller unless you offload layers to CPU, which will be unacceptably slow for interactive use.

Devstral Small 2 vs. Alternatives

Model	SWE-bench	VRAM (Q4)	License	Coding specialty
Devstral Small 2 (24B)	68.0%	~14 GB	Apache 2.0	Yes — agent tasks
Qwen2.5-Coder-32B	~65%*	~20 GB	Apache 2.0	Completion + agent
DeepSeek-Coder-V2-Lite (16B)	~40%*	~10 GB	DeepSeek License	Completion
Mistral-Small-3.1 (24B)	~55%*	~14 GB	Apache 2.0	General purpose
Devstral 2 (123B)	72.2%	~70 GB	Modified MIT	Yes — agent tasks

*Approximate, from community benchmarks; verify against current leaderboards before choosing.

Devstral Small 2 is the best option when your constraint is “fit in one 24 GB GPU, run agent tasks well.” If you need completion rather than agentic editing, Qwen2.5-Coder is a better fit. If the 24 GB VRAM constraint is too tight, consider a RunPod A40 (48 GB) where you can run Q8_0 or even FP16.

FAQ

Can I fine-tune Devstral Small 2 on my own codebase?
Yes. The Apache 2.0 license allows fine-tuning without restrictions. GGUF variants are available via the Unsloth and byteshape repositories on Hugging Face. The most practical approach is LoRA fine-tuning on your codebase’s style and conventions using Unsloth — the same workflow as any other 24B model. See the Llama 3 fine-tuning guide with Unsloth for the general process.

Is Devstral Small 2 the same as the original Devstral Small?
No. The original Devstral Small (released mid-2025) is a separate, smaller model. Devstral Small 2 is the second generation, released December 2025, with a significantly higher SWE-bench score. The Ollama tag devstral-small-2 is distinct from devstral.

Does the 256K context window actually work at full length locally?
Practically speaking, at full 256K you’ll run out of VRAM before you run out of context. On a 24 GB GPU with Q4_K_M, effective context before KV-cache pressure causes slowdowns is closer to 32K–64K. The 256K spec is the theoretical maximum; useful for smaller but long files rather than loading an entire large monorepo.

Does it support function/tool calling?
Yes. Devstral was trained with Mistral’s tool-calling format. Ollama handles this transparently. If you’re calling it via the OpenAI-compatible endpoint (http://localhost:11434/v1), the tools parameter in the chat completion API works as expected.

Can I use it on macOS without an NVIDIA GPU?
Yes — Apple Silicon (M2 Pro+ with 32 GB RAM, M3 Max, M4 Pro) can run the Q4_K_M build via Ollama’s Metal backend. A Mac Mini M4 Pro with 48 GB is particularly well-suited; the Q6_K build (~20 GB) fits comfortably with room for the OS.

Sources

Recommended Gear

Was this article helpful?