Continue.dev + Ollama Setup Guide: Pair Programming Offline
The setup takes about 15 minutes. After it, every keypress in VS Code can route through a model running on your own machine — no API key, no usage bill, no code leaving your network.
What you get at the end:
- A chat panel in the editor sidebar backed by a local 7B coding model
- Tab completion that triggers as you type, with sub-400 ms latency on an 8 GB GPU
@Codebasesearch that indexes your project and retrieves relevant context before answering
The two tools: Ollama handles model downloads and inference via a local REST API. Continue.dev is the IDE extension that connects your editor to that API. Versions tested: Ollama v0.24.0 (May 14, 2026) and Continue.dev v1.2.22-vscode (March 27, 2026).
Hardware reality check
CPU-only is possible for chat. It is not usable for autocomplete. A 7B model on a modern CPU generates 3–5 tokens per second — a useful autocomplete suggestion needs 50+ tokens, which means waiting 10–15 seconds per keystroke. That kills the flow completely.
| Setup | RAM | GPU | What’s usable |
|---|---|---|---|
| CPU-only | 8 GB | None | Chat only, 3B models |
| CPU-only | 16 GB | None | Chat only, 7B models at 3–5 tok/s |
| GPU (8 GB VRAM) | 16 GB | RTX 3060 / 4060 or equivalent | Chat + autocomplete, 7B models |
| GPU (16 GB VRAM) | 32 GB | RTX 3090 / 4080 or equivalent | Chat + autocomplete, 13B models |
The sweet spot is an 8 GB GPU. A qwen2.5-coder:7b at q4_K_M quantization uses ~4.7 GB VRAM, leaving room for the OS and the smaller autocomplete model. An RTX 4060 on Amazon hits this comfortably at the budget end of the market.
If you want to test the full stack before committing to hardware, RunPod rents GPU instances by the hour — an A10 with 24 GB VRAM runs everything in this guide with headroom to spare.
Step 1: Install Ollama
Linux (one command):
curl -fsSL https://ollama.com/install.sh | sh
The installer detects NVIDIA CUDA, AMD ROCm, and Vulkan automatically. It registers a systemd service that starts Ollama on boot — no daemon management required after this.
macOS: Download the .dmg from ollama.com and drag it to Applications. Ollama runs as a menu bar app.
Windows: Download the .exe installer from ollama.com. It adds an icon to the system tray.
Verify Ollama is up:
ollama --version
# ollama version 0.24.0
Ollama listens on http://localhost:11434 by default. Continue.dev connects to this address automatically — you don’t touch the API directly.
For a full breakdown of what Ollama can do beyond this setup, see the Ollama 2026 review.
Step 2: Pull the three models
This setup uses separate models for chat, autocomplete, and embeddings. Using one model for everything is possible but causes queuing problems — when you’re mid-keystroke and autocomplete fires, you don’t want it competing with an ongoing chat request.
# Chat and inline editing — 4.7 GB
ollama pull qwen2.5-coder:7b
# Autocomplete — fast and small, 986 MB
ollama pull qwen2.5-coder:1.5b
# Codebase embeddings — 274 MB
ollama pull nomic-embed-text
Total storage: about 6 GB. Models land in ~/.ollama/models on Linux/macOS, C:\Users\<username>\.ollama\models on Windows.
Why Qwen2.5-Coder specifically? The 7B model leads the HumanEval benchmark among open-source 7B-class models as of early 2026 and supports a 32K context window — enough to fit most functions and their surrounding file. The 1.5B model is purpose-built to stay loaded in VRAM alongside the larger one without running out of memory. DeepSeek R1 and Llama are reasonable alternatives for chat; for autocomplete, keep the model small so latency stays under 400 ms.
Verify all three pulled correctly:
ollama list
# NAME SIZE MODIFIED
# qwen2.5-coder:7b 4.7 GB ...
# qwen2.5-coder:1.5b 986 MB ...
# nomic-embed-text:latest 274 MB ...
Step 3: Install Continue.dev
In VS Code:
- Open the Extensions panel (
Ctrl+Shift+X/Cmd+Shift+X) - Search for Continue
- Install the extension — confirm the publisher is
continue.devbefore installing; impersonator extensions exist
Continue is Apache 2.0 licensed, fully open source. It also works in JetBrains (IntelliJ, PyCharm, WebStorm, GoLand) — install from Settings → Plugins if you’re on JetBrains.
After install, a Continue icon appears in the Activity Bar. On first launch it opens a setup wizard. Close it — the manual config in the next step is cleaner than what the wizard produces.
Step 4: Configure config.yaml
Open the config file: Ctrl+Shift+P → Continue: Open Config File.
On Linux/macOS it’s at ~/.continue/config.yaml. On Windows: C:\Users\<username>\.continue\config.yaml.
Replace the contents with:
name: Local Coding Assistant
version: 0.0.1
schema: v1
models:
- name: Qwen 2.5 Coder 7B
provider: ollama
model: qwen2.5-coder:7b
roles:
- chat
- edit
- apply
- name: Qwen 2.5 Coder 1.5B
provider: ollama
model: qwen2.5-coder:1.5b
roles:
- autocomplete
autocompleteOptions:
debounceDelay: 300
maxPromptTokens: 400
- name: Nomic Embed
provider: ollama
model: nomic-embed-text
roles:
- embed
context:
- provider: code
- provider: diff
- provider: terminal
- provider: problems
- provider: folder
- provider: codebase
Save the file. Continue reloads config on save — no restart needed.
What each section does:
- The
chatmodel handles the sidebar panel and answers questions.editandapplylet Continue rewrite files directly (used in agent mode). - The
autocompletemodel generates inline completions.debounceDelay: 300means it waits 300 ms after your last keypress before firing — this stops it from triggering on every character.maxPromptTokens: 400caps how much surrounding context it sends, keeping latency predictable. - The
embedmodel runs during@Codebaseindexing. It is not used during normal chat or autocomplete — it only activates when you trigger codebase search.
The context block tells Continue which @ providers are available in the chat panel. codebase enables @Codebase indexing with the embed model.
Verifying the setup works
Chat: Click the Continue icon in the sidebar. Ask something simple: “Write a Python function that reads a CSV file and returns the number of rows.” You should see a streaming response within 3–6 seconds on a GPU, 20–40 seconds on CPU.
Autocomplete:
Open a code file. Start typing a function. After the debounce delay, greyed-out ghost text appears. Press Tab to accept, Escape to dismiss, Ctrl+Right / Cmd+Right to accept one word at a time.
Codebase indexing:
Open a project folder in VS Code. In the chat panel type @Codebase what does the auth module do?. Continue indexes the folder on first use — expect 30–120 seconds for a medium project. After that, queries use the cached index.
If nothing happens after install, run ollama ps in a terminal to confirm the service is running. Check Continue’s output panel (Ctrl+Shift+P → “Continue: View Logs”) for connection errors.
This stack vs. the alternatives
| Setup | Monthly cost | Autocomplete latency (GPU) | Code leaves machine? | Offline? |
|---|---|---|---|---|
| Continue.dev + Ollama (this guide) | $0 | 150–400 ms | Never | Yes |
| GitHub Copilot | $19/user | 300–600 ms | Yes — GitHub servers | No |
| Continue.dev + Anthropic API | ~$5–20 | 400–800 ms | Yes — Anthropic servers | No |
| Cline + local Ollama | $0 | No autocomplete | Never | Yes |
| Cursor | $20/user | 200–500 ms | Yes — Cursor servers | No |
The privacy advantage is absolute — with this setup there is no configuration option that could accidentally route code to an external server, because Ollama has no cloud component. Compare that to commercial tools where disabling telemetry is a setting you have to actively find.
On model quality: the 7B Qwen model is genuinely capable for boilerplate, function completion, and explaining existing code. It struggles with complex cross-file refactoring and obscure library APIs. The gap narrows significantly when you use @Codebase — injecting the right context compensates for a smaller model more often than you’d expect.
For a deeper comparison of Continue.dev against Cline and Aider, see the 2026 coding agent shootout.
Tuning autocomplete for your hardware
The defaults work. These adjustments help if you find latency too high or suggestions triggering too eagerly:
RTX 3060 / 4060 (8 GB VRAM) — baseline setup:
autocompleteOptions:
debounceDelay: 300
maxPromptTokens: 400
RTX 4080 / 4090 or A-series (16–24 GB VRAM) — faster response:
autocompleteOptions:
debounceDelay: 150
maxPromptTokens: 600
CPU-only — disable autocomplete, use chat only:
autocompleteOptions:
disable: true
Disabling autocomplete on CPU is not a concession — the chat panel still works well for “explain this function”, “write a test for this”, and “what’s wrong with this logic” workflows. Autocomplete is the latency-sensitive feature; chat can tolerate 20–30 second responses.
When NOT to use this setup
CPU-only machine. Autocomplete is unusable; see above. Chat works but feels sluggish. If you regularly ask long questions, the wait between message and response is frustrating enough that most people end up back on a cloud API.
Teams where model consistency matters. Every developer on a local setup runs a different quantization, different Ollama version, different VRAM configuration. If code review quality or suggestion consistency across the team matters, a shared API endpoint is operationally simpler. Continue.dev supports pointing the apiBase field at a remote Ollama instance — one GPU server, many developers — if you want local models without per-machine setup.
Proprietary codebases with strict air-gap requirements and no on-site GPU. This setup keeps code local, but it does require model downloads from ollama.com and continue.dev at setup time. If your environment is fully air-gapped from day one, pre-download the model files and configure Ollama with a custom model path before disconnecting from the network.
You need GPT-4-class reasoning for complex architectural work. A local 7B model is not a GPT-4 replacement. It handles 80% of daily coding assistance well. For the remaining 20% — designing systems, debugging non-obvious concurrency bugs, understanding unfamiliar large codebases — a cloud model is meaningfully better. The right answer for most developers is a hybrid: local models for fast completions and private code, a cloud API for hard questions. Continue.dev supports both in the same config, switchable from a dropdown.
1V1 PLAYBOOK · LOCAL LLM
Cut your local AI bill from $400/month cloud GPU to $47/month at home.
4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.
Get it for $19 (early bird) →Sources
- Ollama v0.24.0 release notes — GitHub
- Continue.dev v1.2.22-vscode release — GitHub
- qwen2.5-coder:7b model page — Ollama Library
- qwen2.5-coder:1.5b model page — Ollama Library
- nomic-embed-text model page — Ollama Library
- Using Ollama with Continue — Continue Docs
- Best local AI coding models for Ollama 2026 — Local AI Master
- Ollama VRAM requirements guide — LocalLLM.in
Recommended Gear
The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →