Jun 7, 2026

Ollama + Open WebUI + pgvector: Sovereign RAG Stack 2026

By AIFoss · 13 min read

ollamaopen-webuipgvectorragselfhostedpostgresql

TL;DR: Three services, one Docker Compose file, zero data leaving your machine. This guide connects Ollama for inference, Open WebUI for the chat frontend, and PostgreSQL with pgvector for document embeddings. The trade-off vs. the simpler two-container setup: more initial config, but persistent RAG data, multi-process safety, and one database to back up.

What you’ll have running after this guide:

Ollama 0.30.x serving LLMs locally (Qwen2.5, Llama 3.3, Mistral, or any model in the library)
Open WebUI 0.9.6 with knowledge bases backed by pgvector — all retrieval stays on-device
PostgreSQL 17 + pgvector 0.8.2 storing embeddings persistently, safe for multi-worker Open WebUI deployments

Honest take: If your documents are sensitive enough that they can’t touch OpenAI or Anthropic’s APIs, this stack is the right call. If you just want local chat with no document search, the basic two-container setup is simpler — stop reading here and follow the Ollama + Open WebUI Linux setup guide instead.

All three tools are open source and free to self-host: Ollama and Open WebUI are MIT licensed; pgvector is PostgreSQL-licensed (BSD-equivalent). No usage limits, no call-home telemetry, no per-query fees.

Why swap the default vector database?

Open WebUI ships with ChromaDB as its vector store. It works for a single user on a single machine. The problem shows up when:

You run Open WebUI with multiple uvicorn workers — ChromaDB’s PersistentClient uses SQLite under the hood, which isn’t fork-safe. Workers inherit the same database connection and corrupt each other’s state under concurrent writes.
You restart the container and lose RAG context because the Chroma data volume wasn’t correctly mounted.
You want a single backup to cover everything — chat history, user accounts, and document embeddings — instead of backing up Chroma separately.

Switching to pgvector fixes all three. The extension runs inside the same PostgreSQL instance Open WebUI already needs for its application database. One service, one backup, no extra containers.

For a deeper look at how pgvector compares to Qdrant and ChromaDB at scale, see the vector database comparison.

Hardware floor

Setup	RAM	GPU	What runs
Minimum (CPU only)	16 GB	None	7B Q4_K_M at 4–8 tok/s; RAG adds 3–5s retrieval
Comfortable	16 GB	RTX 3060 12GB	7B at 28–35 tok/s; 13B at 15–22 tok/s
Recommended	32 GB	RTX 4070 12GB	14B at 40–50 tok/s; 32B Q4 at 18–25 tok/s
Heavy RAG / 70B	64 GB	RTX 4090 24GB	70B Q4_K_M at 20–30 tok/s with fast embedding

The embedding model (nomic-embed-text, 274MB) runs alongside your inference model. On an 8GB VRAM card, both compete for VRAM and you’ll see the inference model partially offloaded to CPU. 12GB+ keeps both fully on-GPU.

CPU-only setups work — expect 10–30s per response instead of 1–3s. If you occasionally need GPU scale for large document batches, RunPod rents A5000s (24GB VRAM) for under $0.30/hr without a long-term commitment.

For hardware build recommendations to pair with this stack, see the GPU server guides on runaihome.com.

Architecture

┌──────────────────────────────────────────────┐
│              Docker bridge network           │
│                                              │
│  ┌──────────────┐    ┌────────────────────┐  │
│  │    Ollama    │◄───│    Open WebUI      │  │
│  │   :11434     │    │      :8080         │  │
│  └──────────────┘    └─────────┬──────────┘  │
│                                │              │
│                  ┌─────────────▼───────────┐  │
│                  │  PostgreSQL 17          │  │
│                  │  + pgvector 0.8.2       │  │
│                  │  :5432                  │  │
│                  └─────────────────────────┘  │
└──────────────────────────────────────────────┘

Open WebUI talks to Ollama for inference and to PostgreSQL for two things: its own application data (users, sessions, settings) and the RAG vector store (embeddings). PostgreSQL handles both roles — no separate Chroma service, no additional volume to manage.

Step 1: Write the Docker Compose file

mkdir ai-stack && cd ai-stack
nano compose.yaml

Paste the following:

services:
  postgres:
    image: pgvector/pgvector:pg17
    restart: unless-stopped
    environment:
      POSTGRES_DB: openwebui
      POSTGRES_USER: openwebui
      POSTGRES_PASSWORD: changeme_strong_password
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U openwebui -d openwebui"]
      interval: 10s
      timeout: 5s
      retries: 5

  ollama:
    image: ollama/ollama:latest
    restart: unless-stopped
    ports:
      - "127.0.0.1:11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    # Uncomment for NVIDIA GPU:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: all
    #           capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    restart: unless-stopped
    ports:
      - "3000:8080"
    depends_on:
      postgres:
        condition: service_healthy
    environment:
      OLLAMA_BASE_URL: http://ollama:11434
      DATABASE_URL: postgresql://openwebui:changeme_strong_password@postgres:5432/openwebui
      PGVECTOR_DB_URL: postgresql://openwebui:changeme_strong_password@postgres:5432/openwebui
      VECTOR_DB: pgvector
      RAG_EMBEDDING_ENGINE: ollama
      RAG_EMBEDDING_MODEL: nomic-embed-text
    volumes:
      - open_webui_data:/app/backend/data

volumes:
  postgres_data:
  ollama_data:
  open_webui_data:

Three things worth calling out before you run it:

pgvector/pgvector:pg17 ships with the vector extension pre-installed. You don’t need to run CREATE EXTENSION vector manually — Open WebUI runs that migration on first boot.

Ollama is bound to 127.0.0.1:11434 — accessible to other containers on the Docker network but not exposed to your LAN. This matters: unauthenticated Ollama instances have shown up in security research repeatedly. If you need LAN access, use a reverse proxy with auth rather than exposing port 11434 directly. See the Ollama security guide for the full explanation.

Change changeme_strong_password in all three places it appears (POSTGRES_PASSWORD, DATABASE_URL, PGVECTOR_DB_URL) before running. Use the same value in all three.

Step 2: Start the stack and pull models

docker compose up -d

Docker pulls the three images (roughly 2.5GB total on first run), then starts the services. After 30–60 seconds:

✔ Container ai-stack-postgres-1     Healthy
✔ Container ai-stack-ollama-1       Started
✔ Container ai-stack-open-webui-1   Started

The service_healthy condition in the compose file makes Open WebUI wait for PostgreSQL to accept connections before starting. If you skip the healthcheck and start all three simultaneously, you’ll see Open WebUI crash-loop for 15–20 seconds while Postgres initializes — not a real problem, but noisy.

Now pull the inference and embedding models:

# Inference model — swap for any model that fits your VRAM
docker exec ai-stack-ollama-1 ollama pull qwen2.5:7b

# Embedding model Open WebUI will use for RAG
docker exec ai-stack-ollama-1 ollama pull nomic-embed-text

Why nomic-embed-text? It’s 274MB, produces 768-dimensional vectors, and scores well on MTEB English retrieval benchmarks. For multilingual documents, mxbai-embed-large (670MB, 1024-dim) outperforms it. For minimal footprint, all-minilm (46MB) works but recall quality drops noticeably.

Verify both are available:

docker exec ai-stack-ollama-1 ollama list

NAME                    ID              SIZE    MODIFIED
qwen2.5:7b             845dbda0ea48    4.7 GB  2 minutes ago
nomic-embed-text:latest 0a109f422b47    274 MB  1 minute ago

If you’re running with NVIDIA, confirm GPU offload is working:

docker exec ai-stack-ollama-1 ollama ps
# After loading a model, look for: GPU 100% in the output

Step 3: Configure Open WebUI

Navigate to http://localhost:3000. Create your admin account on first load (first account created automatically gets admin).

Go to Admin Panel → Settings → Documents. Confirm or set:

Setting	Value
Embedding Model Engine	Ollama
Embedding Model	`nomic-embed-text`
Vector DB	`pgvector`
Chunk Size	1000
Chunk Overlap	200

The VECTOR_DB: pgvector environment variable sets the backend automatically — the admin panel just displays the active setting. If the field shows chroma despite the env var, the container didn’t pick up the variable. Restart Open WebUI:

docker compose restart open-webui

Click Save in the Documents panel. A spinner appears briefly as Open WebUI verifies the embedding model is reachable via Ollama. If it returns “embedding model not found,” see the troubleshooting section.

Step 4: Upload a document and test RAG

Start a new chat. Click the + icon below the message input → Upload files. Upload a PDF or plain-text file (a technical report or documentation PDF works well as a test).

Open WebUI will:

Extract text from the document
Split it into chunks (1000 tokens, 200 overlap)
Call nomic-embed-text on Ollama to generate embeddings
Store the vectors in the embeddings table in PostgreSQL via pgvector

Ask something specific that’s in the document. A working RAG response shows a [RAG] badge next to the model name and includes source citations at the bottom.

To confirm pgvector is actually holding the embeddings (not silently falling back to Chroma):

docker exec -it ai-stack-postgres-1 \
  psql -U openwebui -d openwebui \
  -c "SELECT relname, n_live_tup FROM pg_stat_user_tables WHERE n_live_tup > 0;"

You should see rows in tables named something like document_chunks or vectors. Positive row count means embeddings are stored in PostgreSQL, not in a local SQLite/Chroma file.

Step 5: Persistent knowledge bases

For documents you want available across all chats — not just one session — use knowledge bases:

Workspace → Knowledge → + New Knowledge

Name it, upload files, and assign it to a model via the model’s system prompt settings. Every conversation with that model automatically queries the knowledge base before responding.

Knowledge base embeddings live in pgvector alongside chat history and user data. That means one pg_dump covers your entire RAG setup:

# Full backup
docker exec ai-stack-postgres-1 \
  pg_dump -U openwebui openwebui > backup_$(date +%Y%m%d).sql

# Restore to a new instance
cat backup_20260607.sql | \
  docker exec -i new-postgres-1 psql -U openwebui openwebui

This is one of the main reasons to use pgvector over ChromaDB: a single SQL dump is your complete backup — no separate vector store file to remember.

What breaks and how to fix it

Open WebUI won’t start: “database connection refused”

PostgreSQL takes 10–15 seconds to become healthy even with the healthcheck in place, especially on first run when it’s initializing the data directory. Restart Open WebUI manually:

docker compose restart open-webui
docker compose logs open-webui --tail=20

Look for “Application startup complete” in the logs — that’s the signal it connected.

“Embedding model not reachable” in admin panel

The nomic-embed-text pull didn’t complete, or there’s a name mismatch. Check:

docker exec ai-stack-ollama-1 ollama list | grep nomic

If it’s missing, re-run the pull. If the tag differs from what you entered in the admin panel (e.g., the list shows nomic-embed-text:v1.5 but the panel says nomic-embed-text), update the admin panel field or set RAG_EMBEDDING_MODEL=nomic-embed-text:v1.5 in the compose env and restart.

RAG responses cite wrong sections / low accuracy

Chunk size is the most common culprit. The default 1000-token chunks work for prose articles, but technical PDFs with tables and figures often need smaller chunks (400–600 tokens) and higher overlap (100–150). Adjust in Admin Panel → Settings → Documents, then click Re-index All Documents to regenerate embeddings with the new settings.

GPU not being used by Ollama

The GPU deploy block is commented out by default. For NVIDIA:

Install the NVIDIA Container Toolkit
Uncomment the deploy section in compose.yaml
Run docker compose up -d --force-recreate ollama

For AMD, use image: ollama/ollama:rocm and the equivalent ROCm device mapping.

pgvector extension missing error on startup

If Open WebUI logs show extension "vector" does not exist, the pgvector/pgvector:pg17 image didn’t initialize correctly. Tear down and rebuild the postgres volume:

docker compose down
docker volume rm ai-stack_postgres_data
docker compose up -d

The clean volume forces re-initialization. pgvector/pgvector:pg17 creates the extension automatically on first boot.

Understanding RAG quality in this stack

The retrieval pipeline in this setup: query → embed via nomic-embed-text → cosine similarity search in pgvector → top-k chunks → inject into prompt context → LLM response.

On an RTX 3060 with qwen2.5:7b and nomic-embed-text both on GPU: embedding a typical 10-page PDF takes under 8 seconds, and RAG-augmented responses arrive in 3–8s total after the retrieval step. On CPU-only 16GB RAM: expect 30–60s for embedding at upload time and 15–25s for RAG responses.

The quality ceiling is set by two things: your chunking strategy (smaller chunks, more precision; larger chunks, more context per result) and your inference model size. For production document chat where answer quality matters, step up to a 14B model — qwen2.5:14b is the current best-value 14B for instruction following with RAG context. For more on quantization trade-offs, see the GGUF quantization guide.

When this stack is the wrong choice

Single user, personal notes — the default Open WebUI two-container setup (no external Postgres) is simpler and works fine. pgvector adds complexity you don’t need if you’re not running multiple workers and your data isn’t sensitive.

Large document collections (50k+ chunks) — pgvector’s HNSW index handles millions of vectors efficiently, but at very large scale you’ll want index tuning controls that purpose-built vector databases offer. Qdrant has a richer index configuration API for this. The full comparison covers the break-even point.

Team deployment — pgvector handles the vector store, but a team setup also needs Nginx + SSL termination, LDAP or SSO auth, and multi-user role-based access control. Open WebUI supports all of that, but the config overhead is significant. If you need multi-user team access, see the LibreChat setup guide for an alternative that’s designed for team use from the start.

Windows hosts — GPU passthrough for Ollama through Docker Desktop on Windows is unreliable. The better path on Windows is native Ollama install (which handles GPU directly) plus Open WebUI via Docker Desktop. The pgvector container still works on Docker Desktop; the pain point is specifically GPU passthrough to the Ollama container.

FAQ

Can I add a reranker on top of pgvector?

Yes. Open WebUI supports cross-encoder reranking as a second pass after pgvector returns the top-k results. Set it in Admin Panel → Settings → Documents → Reranking Model. A reranker like ms-marco-MiniLM-L-6-v2 runs locally on Ollama. The pipeline becomes: pgvector similarity retrieval → reranker scoring → final ranked chunks → LLM. Noticeably better answer quality on ambiguous queries, at the cost of ~1–2s added latency.

Can I migrate from ChromaDB to pgvector without losing documents?

Not automatically. Changing VECTOR_DB from chroma to pgvector abandons the ChromaDB embeddings — they stay in the Chroma volume but Open WebUI won’t read them. You’ll need to re-upload or re-index documents after the switch. Back up your Chroma volume before changing the env var, then re-index in the new backend.

What’s the right chunk size for different document types?

Prose articles: 800–1200 tokens, 150–200 overlap. Technical documentation with code: 400–600 tokens, 100 overlap (keeps code examples intact). Legal documents: 600–800 tokens, 200 overlap (longer overlap preserves clause context across chunks). Start with the defaults (1000/200) and adjust based on whether answers feel too narrow (increase chunk size) or too scattered (decrease it).

Does pgvector 0.8.2 fix anything I should care about?

CVE-2026-3172 is a buffer overflow in parallel HNSW index builds. Single-worker Open WebUI doesn’t trigger the parallel build path, so the practical risk for this stack is low. Upgrade anyway — pgvector/pgvector:pg17 ships 0.8.2 and the upgrade is free.

How do I add web search to augment RAG?

Open WebUI has native web search integration (SearXNG, Brave, Google PSE, etc.) that runs alongside document RAG. Set it up in Admin Panel → Settings → Web Search. Queries can pull from both your documents and live web results simultaneously. The Open WebUI Pipelines guide covers the full configuration.

Sources

Recommended Gear

RTX 3060 12GB — minimum GPU for comfortable 7B + embedding inference
RTX 4070 12GB — recommended for 14B models at usable speeds

Was this article helpful?