Self-Hosted AI for Privacy: The Minimum Viable Stack

aiopensourceselfhostedllmreview

Every prompt you type into ChatGPT, Claude, or Gemini gets logged. By default, ChatGPT uses those conversations to improve its models (opt-out exists but requires a manual step); Gemini retains conversations for 18 months by default with human reviewer access; and paying $20/month for a consumer plan doesn’t change any of that. A 2026 federal court ruling in United States v. Heppner established that AI conversations carry no legal confidentiality protection.

If you’re doing anything sensitive — client work, proprietary code, medical information, legal analysis — the risk isn’t hypothetical. The data leaves your machine, crosses someone else’s network, and sits on infrastructure you have no control over.

This article covers the minimum viable stack to prevent that. Not the maximum privacy stack — not Qubes OS plus hardware-level isolation plus airgapped hardware — but the practical floor that an individual developer or small team can actually run on commodity hardware, today.

What the cloud AI services actually send

The risk isn’t limited to the text of your message. Most cloud AI chat interfaces also transmit:

  • System prompts and conversation context — your entire chat history for the session, plus any injected context from the application
  • Uploaded files — documents, PDFs, and images you attach for analysis
  • Clipboard pastes — code you paste is often sent verbatim with minimal sanitization
  • Browser extension context — if you use an AI browser extension, it may send the active tab’s content

Even companies with good intentions get breached. Past infrastructure bugs at OpenAI have exposed conversation metadata to other users — and when that happens, you can’t selectively revoke data that was already logged. The safest data is data that never traveled.

What “minimum viable” means here

A usable privacy stack needs to cover four categories of AI use that people actually reach for cloud services to handle:

  1. Conversational LLM — the ChatGPT-replacement use case: ask questions, analyze documents, write and review code
  2. Document RAG — feeding your own files (contracts, codebases, notes) to a model and querying them
  3. Web-augmented search — letting the model pull current information without sending your query to Google
  4. Transcription — converting voice memos and meeting recordings to text

Every layer below addresses one of those categories with an open-source, self-hosted tool that keeps data on your machine. Nothing in this stack requires an internet connection to function after initial setup.

Hardware tiers

The right hardware expectation depends on model size, which drives everything else.

TierHardwarePractical model rangeUse case
CPU-only16GB RAM, modern CPU3B–7B (Q4)Light daily use, code assistance
Entry GPU8GB VRAM (RTX 3060/4060)7B–8B (Q4_K_M)Comfortable daily driver
Mid GPU16GB VRAM (RTX 3080/4070 Ti)13B–14B (Q4_K_M)Faster inference, longer context
High-end24GB VRAM (RTX 3090/4090)32B (Q4_K_M)Near-frontier quality locally

A 7B model at Q4_K_M quantization needs roughly 4–6GB VRAM. A 13B model needs 8–10GB. If you’re on CPU only, stick to 7B or smaller — inference is slower but viable for non-interactive tasks like document analysis. For GPU sizing advice, runaihome.com covers current GPU options for local AI in more depth.

If you need frontier-quality output (Llama 3.3 70B territory) but don’t have the hardware, RunPod rents GPU instances where you maintain full control of the API endpoint — the data still doesn’t route through OpenAI or Anthropic infrastructure.

Layer 1: LLM inference — Ollama

Version tested: v0.24.0 (May 14, 2026) | License: MIT

Ollama is the standard choice for running quantized models locally. It exposes an OpenAI-compatible REST API on localhost:11434, manages model downloads, and handles GPU/CPU offloading automatically.

Install on Linux:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.2:3b       # CPU-only machines
ollama pull qwen2.5-coder:7b  # 8GB VRAM
ollama pull llama3.1:14b      # 16GB VRAM

The critical privacy configuration step is binding Ollama to localhost only — it binds there by default, but some Docker guides expose it on 0.0.0.0 for container networking. If you’re not using Docker, confirm the default:

# Check what address Ollama is actually listening on
ss -tlnp | grep 11434

If you see 0.0.0.0:11434 on a machine that has a public IP, you’ve accidentally exposed an unauthenticated LLM API to the internet. Set OLLAMA_HOST=127.0.0.1 in your environment and restart.

For a detailed walkthrough of Ollama’s full feature set, see the Ollama 2026 review. For understanding the Q4_K_M vs Q5_K_M vs Q8_0 quantization trade-offs, see the GGUF quantization guide.

Layer 2: Chat frontend — Open WebUI

Version tested: v0.9.5 (May 10, 2026) | License: Open WebUI License

Open WebUI connects to Ollama’s API and gives you a full ChatGPT-style interface: conversation history, model switching, file uploads, image generation, and a native RAG pipeline. The entire application runs in Docker and stores its database locally (SQLite by default).

docker run -d \
  --network=host \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

With --network=host, the container can reach Ollama’s localhost port directly. Open WebUI’s interface is then available at http://localhost:3000 — not exposed to the internet unless you explicitly forward that port or set up a reverse proxy (more on that in the network security section).

Open WebUI v0.9.5 includes a native document RAG pipeline. Upload PDFs, text files, or paste URLs, and the model can query them directly. For simple use cases, this replaces the need for a separate RAG tool. If you need multi-user RAG with more control over the vector store and embedding models, see the AnythingLLM review for a capability overview and the AnythingLLM RAG setup guide for the full walkthrough.

For the full Open WebUI feature breakdown, the Open WebUI 2026 review covers install options, RBAC, and limitations.

Layer 3: Web-augmented search — SearXNG

License: AGPL-3.0 | Releases: rolling (date-based, check searxng/searxng releases)

When users ask about current events or information that cuts off at a model’s training date, the usual answer is to send the query to a web search API. The privacy-preserving version is SearXNG: a self-hosted metasearch engine that aggregates results from Google, Bing, DuckDuckGo, and 70+ other sources, strips tracking parameters, and returns clean results — all from your own machine.

Docker Compose setup:

services:
  searxng:
    image: searxng/searxng:latest
    ports:
      - "127.0.0.1:8080:8080"
    volumes:
      - ./searxng:/etc/searxng:rw
    environment:
      - SEARXNG_BASE_URL=http://localhost:8080/
    restart: unless-stopped

Binding to 127.0.0.1:8080 instead of 0.0.0.0:8080 keeps the instance off the network. Open WebUI can then call the SearXNG API directly over localhost for web-augmented chat responses.

The privacy model here: your queries still reach external search engines, but they’re routed through your IP without any account association or persistent cookie tracking from the search provider’s side. Your searches are not correlated to a profile.

Layer 4: Transcription — faster-whisper

Version tested: v1.2.1 (October 2025) | License: MIT

faster-whisper is a reimplementation of OpenAI’s Whisper model using the CTranslate2 inference engine. It’s 2–4× faster than the original on the same hardware and supports INT8 quantization on CPU, making it practical without a GPU for the small and medium model sizes.

Install and transcribe:

pip install faster-whisper

python3 - <<'EOF'
from faster_whisper import WhisperModel

# "small" runs on CPU; use "large-v3" if you have a GPU
model = WhisperModel("small", device="cpu", compute_type="int8")
segments, info = model.transcribe("meeting.mp3", beam_size=5)

for segment in segments:
    print(f"[{segment.start:.1f}s] {segment.text.strip()}")
EOF

The small model (244MB) runs comfortably on any modern CPU in roughly real-time for typical speech. The large-v3 model is significantly more accurate for accented speech and technical vocabulary but needs a GPU for reasonable speed. See the faster-whisper vs Whisper.cpp vs WhisperX comparison for a full benchmark breakdown.

For transcription, the privacy implications are straightforward: meeting recordings often contain confidential business information, medical discussions, or legal conversations. Any cloud transcription service — even one with a DPA — routes audio through servers you don’t own. On-device transcription with faster-whisper means the audio never leaves your machine.

Network security: the part most guides skip

Running all four layers locally doesn’t automatically make the stack private. There are two common failure modes:

Exposed ports. Ollama defaults to localhost:11434 — but Docker Compose can inadvertently override that. If you run Ollama inside Docker and map -p 11434:11434 without binding to 127.0.0.1, the port is accessible from any IP that can reach your machine. Same applies to Open WebUI on port 3000 and SearXNG on port 8080. Always bind to 127.0.0.1 in Docker port mappings:

ports:
  - "127.0.0.1:3000:8080"  # Open WebUI: localhost only

Unprotected reverse proxy. If you need to access your stack from outside your local network (remote work, a home server you SSH into), use a reverse proxy with authentication rather than exposing ports directly. Nginx or Traefik with Let’s Encrypt TLS and HTTP Basic Auth is the minimum:

server {
    listen 443 ssl;
    server_name ai.yourdomain.com;

    auth_basic "Private";
    auth_basic_user_file /etc/nginx/.htpasswd;

    location / {
        proxy_pass http://localhost:3000;
        proxy_buffering off;           # required for LLM token streaming
        proxy_http_version 1.1;
        proxy_read_timeout 300s;
    }
}

Nginx’s default proxy buffering breaks LLM token streaming. The proxy_buffering off directive is not optional if you want responses to appear incrementally rather than in one block after 60 seconds.

When this stack won’t be enough

Self-hosted AI eliminates cloud vendor data exposure, but it doesn’t solve every privacy concern:

  • Your network traffic is still visible to your ISP. Model downloads (multi-gigabyte GGUF files from Hugging Face or Ollama’s registry) are logged as traffic. A determined adversary with ISP-level visibility can infer what you’re running.
  • Logs on the host machine. Ollama, Open WebUI, and SearXNG all write logs locally. If someone has physical or root access to your machine, those logs are readable. Configure log rotation and consider encrypting the data directory.
  • The model itself may be compromised. Quantized models distributed via Hugging Face are not code-signed in any meaningful way. Verify checksums when downloading from the official model cards.
  • Multi-user deployments need proper auth. Open WebUI’s RBAC is solid for small teams, but it’s not a substitute for network-level access controls if multiple people share the same instance.
  • Multimodal input. If you enable Open WebUI’s image generation integration (Stable Diffusion via ComfyUI or Automatic1111), verify those services are also bound to localhost.

If you’re operating under HIPAA, GDPR, or CCPA requirements for patient or client data, self-hosting is necessary but not sufficient — you also need data residency controls, audit logging, and a documented retention policy. The stack above gives you the technical foundation; compliance is a separate layer on top.

Stack summary

ComponentRoleLicenseHardware floor
Ollama v0.24.0LLM inference, model managementMIT8GB RAM (CPU); 4GB VRAM (GPU)
Open WebUI v0.9.5Chat UI, RAG, document Q&AOpen WebUI LicenseAny (runs in browser)
SearXNG (rolling)Local metasearch, web-augmented responsesAGPL-3.0~256MB RAM
faster-whisper v1.2.1Audio transcriptionMITModern CPU (small model)
Nginx / TraefikTLS, authentication, reverse proxyBSD / MITNegligible

The full stack runs on a machine with 16GB RAM and no GPU. A GPU (8GB VRAM minimum) makes the LLM layer meaningfully faster but isn’t a requirement for most use cases.

This is the practical floor. Everything above it is optional hardening — network-level isolation, full-disk encryption, VPN tunneling — that you add based on your specific threat model. Start here, verify each service is bound to localhost, and you’ve already eliminated the most common vector: accidental data exposure via a third-party API call you didn’t know your toolchain was making.

1V1 PLAYBOOK · LOCAL LLM

Cut your local AI bill from $400/month cloud GPU to $47/month at home.

4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.

Get it for $19 (early bird) →

Sources


The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):

Was this article helpful?