Jun 4, 2026

vLLM Production Setup 2026: Nginx, Auth, Multiple Models

By AIFoss · 10 min read

vllmaillmdockernginxgpuselfhosted

TL;DR: This guide turns a single-machine vLLM install into a team-facing API with authentication, nginx routing, and multiple models on separate ports. All of it runs on-prem, costs nothing beyond hardware, and is wire-compatible with the OpenAI Python SDK. The setup takes about 30 minutes and requires familiarity with Docker and nginx.

What you’ll have running after this guide:

A Docker Compose stack with vLLM v0.22.0 behind nginx, API key auth, and Prometheus monitoring
Two models served on separate ports and unified under a single nginx endpoint
An OpenAI Python client pointed at your local server, working without any code changes

Honest take: Skip this guide if you’re running vLLM for personal use — the basic setup guide gets you serving in 10 minutes. Come back here when more than one person on your team needs API access or you need to run multiple models from one machine.

vLLM v0.22.0, Apache 2.0 license, released May 29, 2026.

What This Adds Over the Basic Setup

The basic vLLM setup gives you a vllm serve process on localhost with no auth and no process manager. That’s fine for local experimentation.

A team-facing API needs more:

Feature	Basic Setup	Production Setup (this guide)
Authentication	None — open to anyone on the network	Bearer token via `--api-key` flag
Process management	Manual (`vllm serve` in terminal)	Docker Compose with restart policies
Multiple models	One terminal per model	Separate containers, nginx-routed
SSL termination	No	nginx (cert drop-in ready)
Monitoring	None	Prometheus `/metrics` on same port
Model swapping	Kill and restart	`docker compose up -d --no-deps`

Prerequisites

Docker Engine ≥ 23.0 with Compose V2 (docker compose, not docker-compose)
NVIDIA Container Toolkit ≥ 1.14
NVIDIA driver ≥ 525, CUDA ≥ 12.1
At least 24 GB VRAM per model in FP16; an RTX 4090 (24 GB) handles two 7B–8B models in INT4 simultaneously on a dual-GPU board
A Hugging Face account with model access if you’re serving gated models (Llama 3)

If you don’t own the hardware yet, RunPod A100 or H100 instances are the fastest way to validate this setup before committing to a purchase. The configuration below runs unmodified on RunPod.

1. Project Layout

Create the working directory:

mkdir vllm-prod && cd vllm-prod
mkdir -p nginx/conf.d monitoring

Final structure:

vllm-prod/
├── docker-compose.yml
├── .env
├── nginx/
│   └── conf.d/
│       └── vllm.conf
└── monitoring/
    └── prometheus.yml

2. Docker Compose Stack

Create docker-compose.yml:

version: "3.9"

services:
  vllm-primary:
    image: vllm/vllm-openai:v0.22.0
    runtime: nvidia
    restart: unless-stopped
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
      - VLLM_API_KEY=${API_KEY}
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --api-key ${API_KEY}
      --host 0.0.0.0
      --port 8000
      --gpu-memory-utilization 0.85
      --max-model-len 32768
    volumes:
      - hf-cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]
    networks:
      - vllm-net

  vllm-secondary:
    image: vllm/vllm-openai:v0.22.0
    runtime: nvidia
    restart: unless-stopped
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
      - VLLM_API_KEY=${API_KEY}
    command: >
      --model mistralai/Mistral-7B-Instruct-v0.3
      --api-key ${API_KEY}
      --host 0.0.0.0
      --port 8001
      --gpu-memory-utilization 0.85
      --max-model-len 32768
    volumes:
      - hf-cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]
    networks:
      - vllm-net

  nginx:
    image: nginx:1.27-alpine
    restart: unless-stopped
    ports:
      - "80:80"
    volumes:
      - ./nginx/conf.d:/etc/nginx/conf.d:ro
    depends_on:
      - vllm-primary
      - vllm-secondary
    networks:
      - vllm-net

  prometheus:
    image: prom/prometheus:v3.0.0
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
    networks:
      - vllm-net

volumes:
  hf-cache:

networks:
  vllm-net:
    driver: bridge

Note that vllm-primary and vllm-secondary expose no ports directly to the host — all external traffic enters through nginx on port 80. The vLLM containers are reachable only from within the vllm-net Docker network.

Create .env in the same directory:

HF_TOKEN=hf_yourtoken123
API_KEY=change-this-to-a-long-random-string

The API_KEY value is what clients send as Authorization: Bearer <key>. vLLM validates it natively via the --api-key flag — no custom middleware needed.

3. Nginx Config

Create nginx/conf.d/vllm.conf:

upstream llama {
    server vllm-primary:8000;
}

upstream mistral {
    server vllm-secondary:8001;
}

server {
    listen 80;
    server_name _;

    # /llama/* → Llama 3.1 8B on port 8000
    location /llama/ {
        rewrite ^/llama/(.*) /$1 break;
        proxy_pass         http://llama;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_read_timeout 300s;
        proxy_buffering    off;
        proxy_http_version 1.1;
        proxy_set_header   Connection "";
    }

    # /mistral/* → Mistral 7B on port 8001
    location /mistral/ {
        rewrite ^/mistral/(.*) /$1 break;
        proxy_pass         http://mistral;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_read_timeout 300s;
        proxy_buffering    off;
        proxy_http_version 1.1;
        proxy_set_header   Connection "";
    }

    # Default → primary model
    location / {
        proxy_pass         http://llama;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_read_timeout 300s;
        proxy_buffering    off;
        proxy_http_version 1.1;
        proxy_set_header   Connection "";
    }
}

Two things here that aren’t obvious:

proxy_buffering off — without this, nginx holds the entire response in memory before forwarding it to the client. For streaming LLM output (Server-Sent Events), that means the user sees nothing until generation is complete. proxy_buffering off lets tokens stream through as they arrive.

proxy_read_timeout 300s — nginx’s default is 60 seconds. Long-generation requests (large prompts, high max_tokens) will exceed this and get dropped mid-response. 300s covers most 8B model workloads; push to 600s if you’re running 70B models.

4. Prometheus Monitoring

Create monitoring/prometheus.yml:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: vllm-primary
    static_configs:
      - targets: ['vllm-primary:8000']
    metrics_path: /metrics

  - job_name: vllm-secondary
    static_configs:
      - targets: ['vllm-secondary:8001']
    metrics_path: /metrics

vLLM exposes Prometheus-formatted metrics at /metrics on the same port as the API — no separate process needed. Key metrics to monitor:

Metric	What it tells you
`vllm:num_requests_running`	Active inference requests — watch for sustained saturation
`vllm:gpu_cache_usage_perc`	KV cache fill percentage — above 95% consistently means reduce `--max-model-len`
`vllm:e2e_request_latency_seconds`	End-to-end latency histogram
`vllm:num_requests_waiting`	Queue depth — a rising value means the GPU is falling behind

Verify the metrics endpoint is live after startup:

curl http://localhost:8000/metrics | grep vllm:num_requests_running

Expected output:

# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="meta-llama/Llama-3.1-8B-Instruct",...} 0.0

Point a Grafana instance at http://localhost:9090 and import community dashboard ID 22688 for a pre-built vLLM panel.

5. Start the Stack

docker compose up -d

On first start, vLLM downloads model weights into the hf-cache volume. Llama 3.1 8B FP16 is ~16 GB; Mistral 7B is ~14 GB. Allow 5–15 minutes depending on your connection.

Watch primary startup:

docker compose logs -f vllm-primary

Ready when you see:

INFO:     Started server process [1]
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

6. Test With the OpenAI Python Client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost/llama/v1",
    api_key="change-this-to-a-long-random-string",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain PagedAttention in one paragraph."}],
    max_tokens=200,
)

print(response.choices[0].message.content)

Expected output (abbreviated):

PagedAttention is a memory management technique for LLM inference that treats the
KV cache as paged virtual memory, similar to how operating system virtual memory works.
Instead of allocating a fixed contiguous block of GPU memory per request...

Switch to Mistral by changing base_url to http://localhost/mistral/v1 and model to "mistralai/Mistral-7B-Instruct-v0.3". No other changes needed — the OpenAI client is unaware of the routing.

Verify a bad API key returns 401 as expected:

curl -s -o /dev/null -w "%{http_code}" \
  http://localhost/v1/models \
  -H "Authorization: Bearer wrong-key"
# → 401

7. Graceful Model Swapping

To replace the secondary model without touching primary:

# 1. Edit docker-compose.yml: change --model for vllm-secondary to the new model
# 2. Restart only that service
docker compose up -d --no-deps vllm-secondary

Docker Compose restarts only vllm-secondary. In-flight requests on vllm-primary continue uninterrupted. The secondary is unavailable for 2–5 minutes while weights load from the cache volume (faster if the model was already downloaded).

Common problem: if the new model requires more VRAM than the previous one and gpu-memory-utilization 0.85 causes an OOM, reduce to 0.80 and restart again. You’ll see a CUDA OOM error in docker compose logs vllm-secondary within the first 30 seconds of startup.

Adding HTTPS

The nginx config above is HTTP-only. To add TLS:

Obtain a cert (Let’s Encrypt via Certbot, or a self-signed cert for internal use)
Mount the cert into the nginx container: add - ./certs:/etc/nginx/certs:ro to the nginx volumes
Add to vllm.conf:

server {
    listen 443 ssl;
    ssl_certificate     /etc/nginx/certs/fullchain.pem;
    ssl_certificate_key /etc/nginx/certs/privkey.pem;
    ssl_protocols       TLSv1.3;
    # ... rest of location blocks unchanged
}

server {
    listen 80;
    return 301 https://$host$request_uri;
}

The vLLM containers themselves need no changes — TLS terminates at nginx.

When NOT to Use This Setup

Single user on a local machine — Docker Compose + nginx adds startup time and complexity you don’t need. Use pip install vllm and vllm serve directly. The basic setup guide has you running in 10 minutes.

Cloud inference under ~$200/month — a dedicated GPU server running below 60% utilization costs more than RunPod serverless endpoints at that usage level. The math shifts above ~80% sustained utilization.

Low-latency single-user chatbot — Ollama has lower cold-start overhead and simpler process management for a single user. See the Ollama vs vLLM comparison for a detailed breakdown of where each wins.

Windows host — the NVIDIA Container Toolkit on Windows WSL2 has known issues with GPU passthrough in Compose stacks. Run this on bare-metal Linux or a Linux VM.

FAQ

Can I run both models on a single GPU?
Not recommended. vLLM’s --gpu-memory-utilization is per-process — two instances on the same GPU compete for VRAM and will likely OOM each other. Use device_ids: ["0"] for primary and device_ids: ["1"] for secondary to pin each to its own GPU. If you only have one GPU, serve one model at a time.

How do I rotate the API key without downtime?
vLLM reads VLLM_API_KEY only at startup — it doesn’t hot-reload. Update .env, then run docker compose up -d to cycle all services. There’s a ~30-second window where services are restarting.

The OpenAI client gets a connection timeout on long responses. What’s wrong?
Your client has its own timeout separate from nginx’s proxy_read_timeout. In the Python SDK: client = OpenAI(..., timeout=300.0). Match it to whatever nginx timeout you’ve set.

Can I use tensor parallelism across multiple GPUs for a single model?
Yes — add --tensor-parallel-size 2 (or however many GPUs) to the vLLM command and update device_ids to list all GPU indices. Remove the secondary service from Compose if you’re dedicating all GPUs to one model.

How do I check what model a vLLM instance is serving?
curl http://localhost:8000/v1/models -H "Authorization: Bearer <your-key>" — returns a JSON list with the model ID.

Sources

Recommended Gear

RTX 4090 — 24 GB VRAM, handles two 7B–8B models in INT4 on a dual-GPU workstation

Was this article helpful?