May 21, 2026

Continue.dev + Ollama Setup Guide: Pair Programming Offline

By AIFoss · 10 min read

ollamaaiselfhostedllmopensource

The setup takes about 15 minutes. After it, every keypress in VS Code can route through a model running on your own machine — no API key, no usage bill, no code leaving your network.

What you get at the end:

A chat panel in the editor sidebar backed by a local 7B coding model
Tab completion that triggers as you type, with sub-400 ms latency on an 8 GB GPU
@Codebase search that indexes your project and retrieves relevant context before answering

The two tools: Ollama handles model downloads and inference via a local REST API. Continue.dev is the IDE extension that connects your editor to that API. Versions tested: Ollama v0.24.0 (May 14, 2026) and Continue.dev v1.2.22-vscode (March 27, 2026).

Hardware reality check

CPU-only is possible for chat. It is not usable for autocomplete. A 7B model on a modern CPU generates 3–5 tokens per second — a useful autocomplete suggestion needs 50+ tokens, which means waiting 10–15 seconds per keystroke. That kills the flow completely.

Setup	RAM	GPU	What’s usable
CPU-only	8 GB	None	Chat only, 3B models
CPU-only	16 GB	None	Chat only, 7B models at 3–5 tok/s
GPU (8 GB VRAM)	16 GB	RTX 3060 / 4060 or equivalent	Chat + autocomplete, 7B models
GPU (16 GB VRAM)	32 GB	RTX 3090 / 4080 or equivalent	Chat + autocomplete, 13B models

The sweet spot is an 8 GB GPU. A qwen2.5-coder:7b at q4_K_M quantization uses ~4.7 GB VRAM, leaving room for the OS and the smaller autocomplete model. An RTX 4060 on Amazon hits this comfortably at the budget end of the market.

If you want to test the full stack before committing to hardware, RunPod rents GPU instances by the hour — an A10 with 24 GB VRAM runs everything in this guide with headroom to spare.

Step 1: Install Ollama

Linux (one command):

curl -fsSL https://ollama.com/install.sh | sh

The installer detects NVIDIA CUDA, AMD ROCm, and Vulkan automatically. It registers a systemd service that starts Ollama on boot — no daemon management required after this.

macOS: Download the .dmg from ollama.com and drag it to Applications. Ollama runs as a menu bar app.

Windows: Download the .exe installer from ollama.com. It adds an icon to the system tray.

Verify Ollama is up:

ollama --version
# ollama version 0.24.0

Ollama listens on http://localhost:11434 by default. Continue.dev connects to this address automatically — you don’t touch the API directly.

For a full breakdown of what Ollama can do beyond this setup, see the Ollama 2026 review.

Step 2: Pull the three models

This setup uses separate models for chat, autocomplete, and embeddings. Using one model for everything is possible but causes queuing problems — when you’re mid-keystroke and autocomplete fires, you don’t want it competing with an ongoing chat request.

# Chat and inline editing — 4.7 GB
ollama pull qwen2.5-coder:7b

# Autocomplete — fast and small, 986 MB
ollama pull qwen2.5-coder:1.5b

# Codebase embeddings — 274 MB
ollama pull nomic-embed-text

Total storage: about 6 GB. Models land in ~/.ollama/models on Linux/macOS, C:\Users\<username>\.ollama\models on Windows.

Why Qwen2.5-Coder specifically? The 7B model leads the HumanEval benchmark among open-source 7B-class models as of early 2026 and supports a 32K context window — enough to fit most functions and their surrounding file. The 1.5B model is purpose-built to stay loaded in VRAM alongside the larger one without running out of memory. DeepSeek R1 and Llama are reasonable alternatives for chat; for autocomplete, keep the model small so latency stays under 400 ms.

Verify all three pulled correctly:

ollama list
# NAME                        SIZE    MODIFIED
# qwen2.5-coder:7b            4.7 GB  ...
# qwen2.5-coder:1.5b          986 MB  ...
# nomic-embed-text:latest     274 MB  ...

Step 3: Install Continue.dev

In VS Code:

Open the Extensions panel (Ctrl+Shift+X / Cmd+Shift+X)
Search for Continue
Install the extension — confirm the publisher is continue.dev before installing; impersonator extensions exist

Continue is Apache 2.0 licensed, fully open source. It also works in JetBrains (IntelliJ, PyCharm, WebStorm, GoLand) — install from Settings → Plugins if you’re on JetBrains.

After install, a Continue icon appears in the Activity Bar. On first launch it opens a setup wizard. Close it — the manual config in the next step is cleaner than what the wizard produces.

Step 4: Configure config.yaml

Open the config file: Ctrl+Shift+P → Continue: Open Config File.

On Linux/macOS it’s at ~/.continue/config.yaml. On Windows: C:\Users\<username>\.continue\config.yaml.

Replace the contents with:

name: Local Coding Assistant
version: 0.0.1
schema: v1

models:
  - name: Qwen 2.5 Coder 7B
    provider: ollama
    model: qwen2.5-coder:7b
    roles:
      - chat
      - edit
      - apply

  - name: Qwen 2.5 Coder 1.5B
    provider: ollama
    model: qwen2.5-coder:1.5b
    roles:
      - autocomplete
    autocompleteOptions:
      debounceDelay: 300
      maxPromptTokens: 400

  - name: Nomic Embed
    provider: ollama
    model: nomic-embed-text
    roles:
      - embed

context:
  - provider: code
  - provider: diff
  - provider: terminal
  - provider: problems
  - provider: folder
  - provider: codebase

Save the file. Continue reloads config on save — no restart needed.

What each section does:

The chat model handles the sidebar panel and answers questions. edit and apply let Continue rewrite files directly (used in agent mode).
The autocomplete model generates inline completions. debounceDelay: 300 means it waits 300 ms after your last keypress before firing — this stops it from triggering on every character. maxPromptTokens: 400 caps how much surrounding context it sends, keeping latency predictable.
The embed model runs during @Codebase indexing. It is not used during normal chat or autocomplete — it only activates when you trigger codebase search.

The context block tells Continue which @ providers are available in the chat panel. codebase enables @Codebase indexing with the embed model.

Verifying the setup works

Chat: Click the Continue icon in the sidebar. Ask something simple: “Write a Python function that reads a CSV file and returns the number of rows.” You should see a streaming response within 3–6 seconds on a GPU, 20–40 seconds on CPU.

Autocomplete: Open a code file. Start typing a function. After the debounce delay, greyed-out ghost text appears. Press Tab to accept, Escape to dismiss, Ctrl+Right / Cmd+Right to accept one word at a time.

Codebase indexing: Open a project folder in VS Code. In the chat panel type @Codebase what does the auth module do?. Continue indexes the folder on first use — expect 30–120 seconds for a medium project. After that, queries use the cached index.

If nothing happens after install, run ollama ps in a terminal to confirm the service is running. Check Continue’s output panel (Ctrl+Shift+P → “Continue: View Logs”) for connection errors.

This stack vs. the alternatives

Setup	Monthly cost	Autocomplete latency (GPU)	Code leaves machine?	Offline?
Continue.dev + Ollama (this guide)	$0	150–400 ms	Never	Yes
GitHub Copilot	$19/user	300–600 ms	Yes — GitHub servers	No
Continue.dev + Anthropic API	~$5–20	400–800 ms	Yes — Anthropic servers	No
Cline + local Ollama	$0	No autocomplete	Never	Yes
Cursor	$20/user	200–500 ms	Yes — Cursor servers	No

The privacy advantage is absolute — with this setup there is no configuration option that could accidentally route code to an external server, because Ollama has no cloud component. Compare that to commercial tools where disabling telemetry is a setting you have to actively find.

On model quality: the 7B Qwen model is genuinely capable for boilerplate, function completion, and explaining existing code. It struggles with complex cross-file refactoring and obscure library APIs. The gap narrows significantly when you use @Codebase — injecting the right context compensates for a smaller model more often than you’d expect.

For a deeper comparison of Continue.dev against Cline and Aider, see the 2026 coding agent shootout.

Tuning autocomplete for your hardware

The defaults work. These adjustments help if you find latency too high or suggestions triggering too eagerly:

RTX 3060 / 4060 (8 GB VRAM) — baseline setup:

autocompleteOptions:
  debounceDelay: 300
  maxPromptTokens: 400

RTX 4080 / 4090 or A-series (16–24 GB VRAM) — faster response:

autocompleteOptions:
  debounceDelay: 150
  maxPromptTokens: 600

CPU-only — disable autocomplete, use chat only:

autocompleteOptions:
  disable: true

Disabling autocomplete on CPU is not a concession — the chat panel still works well for “explain this function”, “write a test for this”, and “what’s wrong with this logic” workflows. Autocomplete is the latency-sensitive feature; chat can tolerate 20–30 second responses.

When NOT to use this setup

CPU-only machine. Autocomplete is unusable; see above. Chat works but feels sluggish. If you regularly ask long questions, the wait between message and response is frustrating enough that most people end up back on a cloud API.

Teams where model consistency matters. Every developer on a local setup runs a different quantization, different Ollama version, different VRAM configuration. If code review quality or suggestion consistency across the team matters, a shared API endpoint is operationally simpler. Continue.dev supports pointing the apiBase field at a remote Ollama instance — one GPU server, many developers — if you want local models without per-machine setup.

Proprietary codebases with strict air-gap requirements and no on-site GPU. This setup keeps code local, but it does require model downloads from ollama.com and continue.dev at setup time. If your environment is fully air-gapped from day one, pre-download the model files and configure Ollama with a custom model path before disconnecting from the network.

You need GPT-4-class reasoning for complex architectural work. A local 7B model is not a GPT-4 replacement. It handles 80% of daily coding assistance well. For the remaining 20% — designing systems, debugging non-obvious concurrency bugs, understanding unfamiliar large codebases — a cloud model is meaningfully better. The right answer for most developers is a hybrid: local models for fast completions and private code, a cloud API for hard questions. Continue.dev supports both in the same config, switchable from a dropdown.

1V1 PLAYBOOK · LOCAL LLM

Cut your local AI bill from $400/month cloud GPU to $47/month at home.

4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.

Get it for $19 (early bird) →

Sources

Recommended Gear

The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):

Was this article helpful?