Devstral Small 2 Review 2026: 68% SWE-bench on RTX 4090
TL;DR: Devstral Small 2 is a 24B Apache 2.0 coding model from Mistral that scores 68% on SWE-bench Verified — a serious benchmark for a local model — and runs on a single RTX 4090. If you want an open-weight coding agent that keeps code on your machine and handles multi-file edits, this is the strongest 24B option available as of June 2026. The catch: it’s purpose-built for agentic software engineering tasks, not casual code completion.
| Devstral Small 2 | Devstral 2 (123B) | Claude Sonnet 4.5 | |
|---|---|---|---|
| Best for | Local deployment, solo devs | Team servers, max quality | Cloud API, highest accuracy |
| SWE-bench Verified | 68.0% | 72.2% | 77.2% |
| VRAM (Q4_K_M) | ~14 GB | ~70 GB+ | API only |
| License | Apache 2.0 | Modified MIT | Proprietary |
| Context window | 256K | 256K | 200K |
| API cost (input/1M) | $0.10 | $0.40 | ~$3.00 |
Honest take: For local coding agents with a single consumer GPU, Devstral Small 2 is the model to run in mid-2026. It won’t match Claude Sonnet 4.5 on hard tasks, but it costs you nothing per token and keeps your code off the internet.
What Devstral Small 2 Is
Mistral released Devstral Small 2 on December 9, 2025, alongside its larger sibling Devstral 2 (123B) and the Mistral Vibe CLI. Where Devstral 2 targets multi-GPU servers, Small 2 targets single-GPU workstations.
The model is fine-tuned specifically for software engineering agent tasks: exploring codebases, editing multiple files in a single pass, and calling tools in agentic loops. It handles those tasks differently from a general-purpose chat model — it’s optimized to read file trees, understand diffs, and apply targeted edits rather than generate boilerplate from scratch.
Key specs (tested on Devstral-Small-2-24B-Instruct-2512):
- Parameters: 24B
- Context window: 256K tokens
- License: Apache 2.0 — commercial use allowed, no revenue threshold restrictions
- Released: December 9, 2025
- Ollama tag:
devstral-small-2
The Apache 2.0 license is meaningful here. The 123B Devstral 2 ships under a modified MIT license that restricts organizations with over $20M in monthly revenue. Small 2 has no such clause — you can deploy it commercially without legal review.
Benchmark Reality Check
68.0% on SWE-bench Verified is the headline number. Here’s what that actually means.
SWE-bench Verified tests models on real GitHub issues from popular Python repositories. A successful “resolve” means the model read the issue, edited the codebase, and passed the existing test suite — without being given the solution. It’s a meaningful proxy for agentic software engineering capability.
For reference:
- GPT-4o: ~38% at launch (early 2024 snapshot)
- Claude Sonnet 3.5: ~49% at launch
- Devstral Small 2 (24B, local): 68.0%
- Devstral 2 (123B, API/server): 72.2%
- Claude Sonnet 4.5 (API, current): 77.2%
A 4.2-point gap between Small 2 and the 123B version is smaller than you’d expect given a 5x parameter difference. The large gap vs. GPT-4o and older Claude versions reflects how much Mistral specialized this model for software agent tasks. General-purpose models trained to be chatty assistants perform worse on this benchmark than a 24B model trained specifically to edit files.
The benchmark also doesn’t tell you everything. On tasks requiring deep reasoning across a large unfamiliar codebase, or multi-file refactors that span many files, you’ll notice the quality gap between 68% and 77% more clearly. For standalone functions, unit tests, and targeted bug fixes, the difference is often imperceptible.
Installation: Ollama in 3 Commands
Ollama is the fastest path to running Devstral Small 2 locally. If you don’t have Ollama installed, the full Ollama setup guide covers it on Linux, macOS, and Windows.
# Pull the model (Q4_K_M by default, ~15 GB)
ollama pull devstral-small-2
# Run interactively
ollama run devstral-small-2
# Or specify a tag explicitly
ollama pull devstral-small-2:24b-instruct-2512-q4_K_M
Quantization options and VRAM requirements:
| Quantization | Size | Min VRAM | Quality |
|---|---|---|---|
| Q4_K_M (default) | ~15 GB | 16 GB | Good for coding tasks |
| Q6_K | ~20 GB | 22 GB | Noticeably better on complex edits |
| Q8_0 | ~26 GB | 28 GB | Near-lossless |
| FP16 | ~48 GB | 50 GB | Reference quality, multi-GPU only |
The Q4_K_M default fits comfortably on an RTX 4090 (24 GB). A Mac Mini M4 Pro with 48 GB unified memory can run Q8_0 with headroom. If you have a 16 GB GPU, Q4_K_M fits but you’ll be tight on context — longer files will cause slowdowns.
For a deeper look at how quantization levels affect output quality for coding tasks, see the GGUF quantization guide.
Once pulled, test the model:
ollama run devstral-small-2 "Write a Python function that finds all duplicate entries in a list of dicts by a given key."
Expect output immediately — Ollama’s built-in GGUF runtime handles tool-calling setup automatically.
Use It With Aider
Aider is where Devstral Small 2 actually shines. Its architect mode is a close match for how the model was designed to work: read the codebase, plan the edit, then apply it.
Install Aider if you haven’t already — the Aider setup guide covers full configuration. Then point it at your local Ollama instance:
# Via local Ollama
aider --model ollama/devstral-small-2:latest
# With explicit context and editor model split
aider \
--model ollama/devstral-small-2:latest \
--architect \
--editor-model ollama/devstral-small-2:latest
The --architect flag puts Aider in a two-step mode: the first call plans the edit, the second applies it. This maps well to how Devstral was trained — it expects to reason about a file tree before making changes.
One practical note: Devstral Small 2 generates longer “thinking” sections when given open-ended architecture questions. For targeted bug fixes (aider --message "fix the race condition in worker.py line 42"), it’s fast and accurate. For open-ended feature requests on a large codebase, give it explicit file context with aider file1.py file2.py rather than letting it figure out which files to open on its own.
Use It With Continue.dev
Continue.dev can use Devstral Small 2 via Ollama’s OpenAI-compatible API. The model’s 256K context window is an advantage here — you can index larger files as context without hitting limits.
In VS Code, open your Continue config (~/.continue/config.json) and add:
{
"models": [
{
"title": "Devstral Small 2 (local)",
"provider": "ollama",
"model": "devstral-small-2:latest",
"apiBase": "http://localhost:11434"
}
]
}
For the agent tab in Continue 0.9+, set it as the default agent model — the tool-calling support Devstral was trained on maps directly to Continue’s agent tool loop. In the VS Code sidebar, select the model from the dropdown and switch to Agent mode.
If you’re using Continue with a team and want a shared Ollama instance, run Ollama on the server with OLLAMA_HOST=0.0.0.0 ollama serve and point the apiBase at the server IP. See the Continue.dev + Ollama guide for multi-user setup details.
Use It With Mistral Vibe CLI
Mistral Vibe is the native CLI that shipped alongside Devstral 2. It’s open-source (MIT license, available on GitHub), built specifically for Devstral, and runs in your terminal without an IDE.
Install and configure it:
# Install via pip
pip install mistral-vibe
# Point at local Ollama (no API key needed)
export OPENAI_BASE_URL="http://localhost:11434/v1"
export OPENAI_API_KEY="ollama"
# Run in your project directory
vibe
Inside the Vibe session, you get file read/write tools, grep, git operations, and shell execution — all driven by Devstral’s tool-calling capability. The --offline flag disables any telemetry and prevents outbound connections.
Vibe is the lowest-friction path if you want a full coding agent loop without integrating Aider or Continue. The trade-off is that it’s newer and less tested than Aider at multi-file refactors. For greenfield code and bug fixes in small codebases, it works well out of the box.
If you want to run Devstral on a cloud GPU rather than locally — useful for testing before committing to hardware — RunPod supports deploying Ollama or vLLM on a single A100 or H100.
When NOT to Use Devstral Small 2
Don’t use it for general-purpose chat. Devstral is specifically fine-tuned for software engineering agent tasks. For summarizing documents, answering questions, or creative writing, a general-purpose 24B model (Qwen2.5-Instruct, Mistral-Small) will perform better.
Don’t use it as inline code completion. Models like Qwen2.5-Coder or DeepSeek-Coder are optimized for fast, token-efficient completions that fill in the next line. Devstral is optimized for multi-file planning and edits — it’s slower and more deliberate than you want for autocomplete.
Don’t expect Claude-level results on hard architecture tasks. The 9-point gap between Devstral Small 2 (68%) and Claude Sonnet 4.5 (77%) compounds on complex multi-file refactors or large unfamiliar codebases. If code quality at the top end matters for your use case, either run the full 123B Devstral 2 on a server or use the Mistral API.
Don’t use it on 8 GB VRAM. The Q4_K_M build needs ~14–15 GB of VRAM. It won’t fit on an RTX 3060 (12 GB) or anything smaller unless you offload layers to CPU, which will be unacceptably slow for interactive use.
Devstral Small 2 vs. Alternatives
| Model | SWE-bench | VRAM (Q4) | License | Coding specialty |
|---|---|---|---|---|
| Devstral Small 2 (24B) | 68.0% | ~14 GB | Apache 2.0 | Yes — agent tasks |
| Qwen2.5-Coder-32B | ~65%* | ~20 GB | Apache 2.0 | Completion + agent |
| DeepSeek-Coder-V2-Lite (16B) | ~40%* | ~10 GB | DeepSeek License | Completion |
| Mistral-Small-3.1 (24B) | ~55%* | ~14 GB | Apache 2.0 | General purpose |
| Devstral 2 (123B) | 72.2% | ~70 GB | Modified MIT | Yes — agent tasks |
*Approximate, from community benchmarks; verify against current leaderboards before choosing.
Devstral Small 2 is the best option when your constraint is “fit in one 24 GB GPU, run agent tasks well.” If you need completion rather than agentic editing, Qwen2.5-Coder is a better fit. If the 24 GB VRAM constraint is too tight, consider a RunPod A40 (48 GB) where you can run Q8_0 or even FP16.
FAQ
Can I fine-tune Devstral Small 2 on my own codebase?
Yes. The Apache 2.0 license allows fine-tuning without restrictions. GGUF variants are available via the Unsloth and byteshape repositories on Hugging Face. The most practical approach is LoRA fine-tuning on your codebase’s style and conventions using Unsloth — the same workflow as any other 24B model. See the Llama 3 fine-tuning guide with Unsloth for the general process.
Is Devstral Small 2 the same as the original Devstral Small?
No. The original Devstral Small (released mid-2025) is a separate, smaller model. Devstral Small 2 is the second generation, released December 2025, with a significantly higher SWE-bench score. The Ollama tag devstral-small-2 is distinct from devstral.
Does the 256K context window actually work at full length locally?
Practically speaking, at full 256K you’ll run out of VRAM before you run out of context. On a 24 GB GPU with Q4_K_M, effective context before KV-cache pressure causes slowdowns is closer to 32K–64K. The 256K spec is the theoretical maximum; useful for smaller but long files rather than loading an entire large monorepo.
Does it support function/tool calling?
Yes. Devstral was trained with Mistral’s tool-calling format. Ollama handles this transparently. If you’re calling it via the OpenAI-compatible endpoint (http://localhost:11434/v1), the tools parameter in the chat completion API works as expected.
Can I use it on macOS without an NVIDIA GPU?
Yes — Apple Silicon (M2 Pro+ with 32 GB RAM, M3 Max, M4 Pro) can run the Q4_K_M build via Ollama’s Metal backend. A Mac Mini M4 Pro with 48 GB is particularly well-suited; the Q6_K build (~20 GB) fits comfortably with room for the OS.
Sources
- Introducing Devstral 2 and Mistral Vibe CLI — Mistral AI
- Devstral Small 2 Model Card — Mistral Docs
- Mistral launches Devstral 2 including open-source, laptop-friendly version — VentureBeat
- Devstral-Small-2-24B-Instruct-2512 — Hugging Face
- Devstral Small 2 GGUF — Unsloth on Hugging Face
- devstral-small-2 — Ollama Library
- Mistral Vibe CLI — GitHub
- Devstral 2 vs. Devstral Small 2 Model Comparison — Artificial Analysis
- Devstral Small 2 Guide — AIMadeTools
Recommended Gear
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →