vLLM Production Setup 2026: Nginx, Auth, Multiple Models
TL;DR: This guide turns a single-machine vLLM install into a team-facing API with authentication, nginx routing, and multiple models on separate ports. All of it runs on-prem, costs nothing beyond hardware, and is wire-compatible with the OpenAI Python SDK. The setup takes about 30 minutes and requires familiarity with Docker and nginx.
What you’ll have running after this guide:
- A Docker Compose stack with vLLM v0.22.0 behind nginx, API key auth, and Prometheus monitoring
- Two models served on separate ports and unified under a single nginx endpoint
- An OpenAI Python client pointed at your local server, working without any code changes
Honest take: Skip this guide if you’re running vLLM for personal use — the basic setup guide gets you serving in 10 minutes. Come back here when more than one person on your team needs API access or you need to run multiple models from one machine.
vLLM v0.22.0, Apache 2.0 license, released May 29, 2026.
What This Adds Over the Basic Setup
The basic vLLM setup gives you a vllm serve process on localhost with no auth and no process manager. That’s fine for local experimentation.
A team-facing API needs more:
| Feature | Basic Setup | Production Setup (this guide) |
|---|---|---|
| Authentication | None — open to anyone on the network | Bearer token via --api-key flag |
| Process management | Manual (vllm serve in terminal) | Docker Compose with restart policies |
| Multiple models | One terminal per model | Separate containers, nginx-routed |
| SSL termination | No | nginx (cert drop-in ready) |
| Monitoring | None | Prometheus /metrics on same port |
| Model swapping | Kill and restart | docker compose up -d --no-deps |
Prerequisites
- Docker Engine ≥ 23.0 with Compose V2 (
docker compose, notdocker-compose) - NVIDIA Container Toolkit ≥ 1.14
- NVIDIA driver ≥ 525, CUDA ≥ 12.1
- At least 24 GB VRAM per model in FP16; an RTX 4090 (24 GB) handles two 7B–8B models in INT4 simultaneously on a dual-GPU board
- A Hugging Face account with model access if you’re serving gated models (Llama 3)
If you don’t own the hardware yet, RunPod A100 or H100 instances are the fastest way to validate this setup before committing to a purchase. The configuration below runs unmodified on RunPod.
1. Project Layout
Create the working directory:
mkdir vllm-prod && cd vllm-prod
mkdir -p nginx/conf.d monitoring
Final structure:
vllm-prod/
├── docker-compose.yml
├── .env
├── nginx/
│ └── conf.d/
│ └── vllm.conf
└── monitoring/
└── prometheus.yml
2. Docker Compose Stack
Create docker-compose.yml:
version: "3.9"
services:
vllm-primary:
image: vllm/vllm-openai:v0.22.0
runtime: nvidia
restart: unless-stopped
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
- VLLM_API_KEY=${API_KEY}
command: >
--model meta-llama/Llama-3.1-8B-Instruct
--api-key ${API_KEY}
--host 0.0.0.0
--port 8000
--gpu-memory-utilization 0.85
--max-model-len 32768
volumes:
- hf-cache:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
networks:
- vllm-net
vllm-secondary:
image: vllm/vllm-openai:v0.22.0
runtime: nvidia
restart: unless-stopped
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
- VLLM_API_KEY=${API_KEY}
command: >
--model mistralai/Mistral-7B-Instruct-v0.3
--api-key ${API_KEY}
--host 0.0.0.0
--port 8001
--gpu-memory-utilization 0.85
--max-model-len 32768
volumes:
- hf-cache:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["1"]
capabilities: [gpu]
networks:
- vllm-net
nginx:
image: nginx:1.27-alpine
restart: unless-stopped
ports:
- "80:80"
volumes:
- ./nginx/conf.d:/etc/nginx/conf.d:ro
depends_on:
- vllm-primary
- vllm-secondary
networks:
- vllm-net
prometheus:
image: prom/prometheus:v3.0.0
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
networks:
- vllm-net
volumes:
hf-cache:
networks:
vllm-net:
driver: bridge
Note that vllm-primary and vllm-secondary expose no ports directly to the host — all external traffic enters through nginx on port 80. The vLLM containers are reachable only from within the vllm-net Docker network.
Create .env in the same directory:
HF_TOKEN=hf_yourtoken123
API_KEY=change-this-to-a-long-random-string
The API_KEY value is what clients send as Authorization: Bearer <key>. vLLM validates it natively via the --api-key flag — no custom middleware needed.
3. Nginx Config
Create nginx/conf.d/vllm.conf:
upstream llama {
server vllm-primary:8000;
}
upstream mistral {
server vllm-secondary:8001;
}
server {
listen 80;
server_name _;
# /llama/* → Llama 3.1 8B on port 8000
location /llama/ {
rewrite ^/llama/(.*) /$1 break;
proxy_pass http://llama;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 300s;
proxy_buffering off;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
# /mistral/* → Mistral 7B on port 8001
location /mistral/ {
rewrite ^/mistral/(.*) /$1 break;
proxy_pass http://mistral;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 300s;
proxy_buffering off;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
# Default → primary model
location / {
proxy_pass http://llama;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 300s;
proxy_buffering off;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}
Two things here that aren’t obvious:
proxy_buffering off — without this, nginx holds the entire response in memory before forwarding it to the client. For streaming LLM output (Server-Sent Events), that means the user sees nothing until generation is complete. proxy_buffering off lets tokens stream through as they arrive.
proxy_read_timeout 300s — nginx’s default is 60 seconds. Long-generation requests (large prompts, high max_tokens) will exceed this and get dropped mid-response. 300s covers most 8B model workloads; push to 600s if you’re running 70B models.
4. Prometheus Monitoring
Create monitoring/prometheus.yml:
global:
scrape_interval: 15s
scrape_configs:
- job_name: vllm-primary
static_configs:
- targets: ['vllm-primary:8000']
metrics_path: /metrics
- job_name: vllm-secondary
static_configs:
- targets: ['vllm-secondary:8001']
metrics_path: /metrics
vLLM exposes Prometheus-formatted metrics at /metrics on the same port as the API — no separate process needed. Key metrics to monitor:
| Metric | What it tells you |
|---|---|
vllm:num_requests_running | Active inference requests — watch for sustained saturation |
vllm:gpu_cache_usage_perc | KV cache fill percentage — above 95% consistently means reduce --max-model-len |
vllm:e2e_request_latency_seconds | End-to-end latency histogram |
vllm:num_requests_waiting | Queue depth — a rising value means the GPU is falling behind |
Verify the metrics endpoint is live after startup:
curl http://localhost:8000/metrics | grep vllm:num_requests_running
Expected output:
# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="meta-llama/Llama-3.1-8B-Instruct",...} 0.0
Point a Grafana instance at http://localhost:9090 and import community dashboard ID 22688 for a pre-built vLLM panel.
5. Start the Stack
docker compose up -d
On first start, vLLM downloads model weights into the hf-cache volume. Llama 3.1 8B FP16 is ~16 GB; Mistral 7B is ~14 GB. Allow 5–15 minutes depending on your connection.
Watch primary startup:
docker compose logs -f vllm-primary
Ready when you see:
INFO: Started server process [1]
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
6. Test With the OpenAI Python Client
from openai import OpenAI
client = OpenAI(
base_url="http://localhost/llama/v1",
api_key="change-this-to-a-long-random-string",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Explain PagedAttention in one paragraph."}],
max_tokens=200,
)
print(response.choices[0].message.content)
Expected output (abbreviated):
PagedAttention is a memory management technique for LLM inference that treats the
KV cache as paged virtual memory, similar to how operating system virtual memory works.
Instead of allocating a fixed contiguous block of GPU memory per request...
Switch to Mistral by changing base_url to http://localhost/mistral/v1 and model to "mistralai/Mistral-7B-Instruct-v0.3". No other changes needed — the OpenAI client is unaware of the routing.
Verify a bad API key returns 401 as expected:
curl -s -o /dev/null -w "%{http_code}" \
http://localhost/v1/models \
-H "Authorization: Bearer wrong-key"
# → 401
7. Graceful Model Swapping
To replace the secondary model without touching primary:
# 1. Edit docker-compose.yml: change --model for vllm-secondary to the new model
# 2. Restart only that service
docker compose up -d --no-deps vllm-secondary
Docker Compose restarts only vllm-secondary. In-flight requests on vllm-primary continue uninterrupted. The secondary is unavailable for 2–5 minutes while weights load from the cache volume (faster if the model was already downloaded).
Common problem: if the new model requires more VRAM than the previous one and gpu-memory-utilization 0.85 causes an OOM, reduce to 0.80 and restart again. You’ll see a CUDA OOM error in docker compose logs vllm-secondary within the first 30 seconds of startup.
Adding HTTPS
The nginx config above is HTTP-only. To add TLS:
- Obtain a cert (Let’s Encrypt via Certbot, or a self-signed cert for internal use)
- Mount the cert into the nginx container: add
- ./certs:/etc/nginx/certs:roto the nginx volumes - Add to
vllm.conf:
server {
listen 443 ssl;
ssl_certificate /etc/nginx/certs/fullchain.pem;
ssl_certificate_key /etc/nginx/certs/privkey.pem;
ssl_protocols TLSv1.3;
# ... rest of location blocks unchanged
}
server {
listen 80;
return 301 https://$host$request_uri;
}
The vLLM containers themselves need no changes — TLS terminates at nginx.
When NOT to Use This Setup
Single user on a local machine — Docker Compose + nginx adds startup time and complexity you don’t need. Use pip install vllm and vllm serve directly. The basic setup guide has you running in 10 minutes.
Cloud inference under ~$200/month — a dedicated GPU server running below 60% utilization costs more than RunPod serverless endpoints at that usage level. The math shifts above ~80% sustained utilization.
Low-latency single-user chatbot — Ollama has lower cold-start overhead and simpler process management for a single user. See the Ollama vs vLLM comparison for a detailed breakdown of where each wins.
Windows host — the NVIDIA Container Toolkit on Windows WSL2 has known issues with GPU passthrough in Compose stacks. Run this on bare-metal Linux or a Linux VM.
FAQ
Can I run both models on a single GPU?
Not recommended. vLLM’s --gpu-memory-utilization is per-process — two instances on the same GPU compete for VRAM and will likely OOM each other. Use device_ids: ["0"] for primary and device_ids: ["1"] for secondary to pin each to its own GPU. If you only have one GPU, serve one model at a time.
How do I rotate the API key without downtime?
vLLM reads VLLM_API_KEY only at startup — it doesn’t hot-reload. Update .env, then run docker compose up -d to cycle all services. There’s a ~30-second window where services are restarting.
The OpenAI client gets a connection timeout on long responses. What’s wrong?
Your client has its own timeout separate from nginx’s proxy_read_timeout. In the Python SDK: client = OpenAI(..., timeout=300.0). Match it to whatever nginx timeout you’ve set.
Can I use tensor parallelism across multiple GPUs for a single model?
Yes — add --tensor-parallel-size 2 (or however many GPUs) to the vLLM command and update device_ids to list all GPU indices. Remove the secondary service from Compose if you’re dedicating all GPUs to one model.
How do I check what model a vLLM instance is serving?
curl http://localhost:8000/v1/models -H "Authorization: Bearer <your-key>" — returns a JSON list with the model ID.
Sources
- vLLM GitHub Releases — v0.22.0
- vLLM Docker Deployment documentation
- vLLM Metrics documentation
- vLLM Security / API key authentication
- nginx proxy_buffering directive reference
- vLLM production-stack repository
- Monitor LLM Inference with Prometheus and Grafana for vLLM
Recommended Gear
- RTX 4090 — 24 GB VRAM, handles two 7B–8B models in INT4 on a dual-GPU workstation
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →