vLLM Setup Guide 2026: Serve Any LLM via OpenAI API

vllmaillmpythongpu

TL;DR: By the end of this guide you’ll be serving any Hugging Face model through a local OpenAI-compatible API endpoint. vLLM v0.21.0 handles the heavy lifting — PagedAttention, continuous batching, multi-GPU tensor parallelism — once you give it the right flags. The bottleneck is VRAM, not setup complexity.

What you’ll have running after this guide:

  • A vLLM server on http://localhost:8000/v1 accepting standard OpenAI API requests
  • A tested endpoint you can point any OpenAI Python client, LangChain app, or chat frontend at
  • Optional: Docker-based deployment with API-key authentication for team use

Honest take: pip install vllm is genuinely that simple. The decision points are which GPU flags to set for your hardware and whether to go bare-metal or Docker. Both paths are below.

vLLM is open-source under the Apache 2.0 license — no usage restrictions for commercial or internal deployments.


Prerequisites

Before you start, check your hardware against what vLLM can actually serve:

SetupGPU VRAMSystem RAMWhat runs well
Entry point16 GB (RTX 3090)32 GBLlama 3.2 3B (FP16), Mistral 7B
Comfortable24 GB64 GBLlama 3.1 8B (FP16), Qwen2.5 14B
Multi-user API48 GB+ (2× RTX 4090)128 GBLlama 3.1 70B (INT4), DeepSeek V3-lite
Cloud alternativeRunPod A100 or H100 on-demand

If your GPU VRAM is under 16 GB, you’ll be constrained to smaller models or quantized variants — that’s fine, just set --dtype float16 and --max-model-len conservatively (covered in Step 4).

Software requirements:

  • Linux — Ubuntu 22.04 or 24.04 recommended. macOS and Windows are not supported by vLLM.
  • Python 3.10–3.14 (Python 3.12 is the sweet spot; tested and well-supported)
  • NVIDIA GPU with CUDA 12.4 and driver 550+. Verify: nvidia-smi
  • At least 50 GB free disk space for model weights

Not on Linux? Use Ollama instead — it supports macOS and Windows natively with near-identical API compatibility. See Ollama vs vLLM 2026 for the full comparison.


Step 1: Install vLLM

Use a fresh Python virtual environment. vLLM compiles CUDA kernels on install, and those kernels are tied to a specific PyTorch and CUDA version. Mixing vLLM into an existing environment causes version conflicts that are hard to debug.

python3 -m venv ~/.venvs/vllm
source ~/.venvs/vllm/bin/activate

pip install vllm

The wheel bundles PyTorch 2.11 and pre-compiled CUDA kernels. Expect a 5–10 minute install on a fresh environment — most of the time is download, not compilation.

Verify the install:

python -c "import vllm; print(vllm.__version__)"
# 0.21.0

If you’d rather use conda to manage the environment:

conda create -n vllm-env python=3.12 -y
conda activate vllm-env
pip install vllm   # still use pip, not conda install

The conda environment for isolation is fine; the package itself must come from pip because the conda-forge build lags behind and often has NCCL conflicts with multi-GPU setups.


Step 2: Serve your first model

The vllm serve command starts an HTTP server. Pass any Hugging Face model ID:

vllm serve meta-llama/Llama-3.2-3B-Instruct

On first run, the model downloads to ~/.cache/huggingface/hub/. Subsequent runs load from cache — fast.

The server binds to http://localhost:8000/v1 by default, matching the OpenAI API base URL. You’ll see startup output like this when it’s ready:

INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

For gated models (Llama 3.1, Llama 3.3, and similar Meta models), you need a Hugging Face access token. Get one at huggingface.co/settings/tokens, accept the model’s terms on the model page, then:

export HF_TOKEN=hf_your_token_here
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct

Step 3: Test the endpoint

The /v1/chat/completions endpoint is OpenAI-compatible. Test with curl before touching any application code:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "What is PagedAttention?"}],
    "max_tokens": 200
  }'

The model field must match exactly what you passed to vllm serve — including the full HuggingFace path. Check what the server sees as its loaded model:

curl http://localhost:8000/v1/models

With the OpenAI Python client:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-used",  # any non-empty string works when --api-key is not set
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Summarize vLLM in two sentences."}],
    max_tokens=200,
)
print(response.choices[0].message.content)

Streaming works identically to the OpenAI SDK — pass stream=True and iterate response. vLLM also supports the /v1/completions (legacy text completion) and /v1/embeddings endpoints.


Step 4: Key configuration flags

The defaults work out of the box for most single-user setups. These flags matter when you hit limits or serve real traffic.

GPU memory utilization

vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.90

Default is 0.90 (90% of VRAM reserved for vLLM). Raise to 0.95 if a model barely fits. Lower to 0.80 if other processes share the GPU. The remaining headroom prevents OOM errors from CUDA context overhead.

Data type

vllm serve ... --dtype float16

auto (default) picks bfloat16 on Ampere+ GPUs. If you hit RuntimeError: No GPU memory available, switch to float16 — it cuts memory usage by ~5% versus bfloat16 at marginal quality loss. float32 is almost never what you want; it doubles VRAM usage.

Context window cap

vllm serve ... --max-model-len 8192

vLLM defaults to the model’s maximum context length (often 128k for modern models). That maximum context length determines how much KV cache VRAM vLLM pre-allocates — even for requests that only use 2k tokens. Setting --max-model-len 8192 reclaims significant VRAM and increases the number of concurrent requests the server can handle. Use the longest context your actual use case needs, not the model’s theoretical maximum.

Host and port

vllm serve ... --host 0.0.0.0 --port 8080

--host 0.0.0.0 exposes the server on all network interfaces, required for Docker containers and remote team access. The default 127.0.0.1 is localhost-only.

API key authentication

vllm serve ... --api-key your-secret-key-here

Once set, all requests must include:

Authorization: Bearer your-secret-key-here

The OpenAI Python client handles this automatically when you pass api_key="your-secret-key-here" to the OpenAI() constructor.


Step 5: Multi-GPU tensor parallelism

For models larger than a single GPU’s VRAM, vLLM distributes the model across GPUs via tensor parallelism. The --tensor-parallel-size value must divide evenly into the model’s attention head count — for most models, 2, 4, or 8 GPUs work.

vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --dtype bfloat16 \
  --max-model-len 32768

Two RTX 4090s (48 GB combined VRAM) can serve Llama 3.1 70B in bfloat16 at this context length. Check GPU visibility first:

nvidia-smi

Multi-GPU containers need extra shared memory for NCCL communication — handled in the Docker step below. If you’re running bare-metal, vLLM manages this automatically.


Step 6: Docker deployment

For repeatable, production-grade setups, the official vLLM image removes all host Python environment concerns:

docker run \
  --gpus all \
  --shm-size 16g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  -e HF_TOKEN=$HF_TOKEN \
  vllm/vllm-openai:latest \
    --model meta-llama/Meta-Llama-3.1-8B-Instruct \
    --api-key my-secret-key \
    --gpu-memory-utilization 0.90

Key flags explained:

  • --gpus all — passes all host GPUs into the container
  • --shm-size 16g — shared memory for NCCL; required for multi-GPU, harmless for single-GPU
  • -v ~/.cache/huggingface:/root/.cache/huggingface — mounts your model cache so you don’t re-download on every container restart
  • Arguments after the image name (--model ...) are passed directly to vllm serve

For multi-GPU Docker:

docker run \
  --gpus all \
  --shm-size 32g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  -e HF_TOKEN=$HF_TOKEN \
  vllm/vllm-openai:latest \
    --model meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --api-key my-secret-key

Common errors and fixes

torch.cuda.OutOfMemoryError on startup

The model doesn’t fit in VRAM at the requested dtype and context length. Try in order:

  1. Lower --max-model-len to reduce KV cache pre-allocation
  2. Lower --gpu-memory-utilization to 0.85 if other processes compete for the GPU
  3. Use a quantized model variant (GPTQ 4-bit or AWQ) — these are available on HuggingFace for most popular models
  4. Switch to a smaller model, or add a second GPU

ValueError: The model's max seq len X is larger than the maximum number of tokens that can be stored in KV cache

Same root cause. Set --max-model-len to a value smaller than what the error reports.

Model not found from API

The model field in your request must match the exact HuggingFace ID passed to vllm serve. Check:

curl http://localhost:8000/v1/models

Slow first response, fast afterward

Normal. vLLM warms up CUDA graphs on the first inference pass. Expect 10–30 seconds for the initial request depending on model size; subsequent requests are fast.


When NOT to use vLLM

A few situations where vLLM is the wrong choice:

Single-user personal use. If you’re the only user, sending one request at a time, Ollama is simpler and works on macOS and Windows. vLLM’s concurrency optimizations idle at zero benefit without concurrent requests.

Apple Silicon or Windows. vLLM doesn’t support Metal or DirectML. Use Ollama or LM Studio on those platforms.

You want a model catalog or chat UI. vLLM is a headless API server. No GUI for browsing or downloading models. Pair it with Open WebUI if you want a frontend — see the vLLM review for that configuration.

Models under 7B for experimentation. The setup overhead isn’t justified for quick experiments. Ollama handles those in under two minutes.


vLLM vs alternatives: setup comparison

vLLMOllamaHF TGI
Installpip install vllmcurl | sh (single binary)Docker only
Model sourceHuggingFace Hub + GGUFOllama library + GGUFHuggingFace Hub only
OpenAI API compatNativeNativeYes
Multi-GPUTensor + pipeline parallelismBasic tensor parallelismPipeline parallelism
macOS / WindowsNoYesNo
GPU memory efficiencyBest (PagedAttention)GoodGood
Setup time~15 min~2 min~20 min
Best forMulti-user API servingSingle-user local useHuggingFace-first teams

Frequently Asked Questions

Does vLLM work without a GPU? Yes — install with pip install vllm[cpu] for the CPU backend. Expect 5–15 tokens/second on a modern multi-core CPU with a 7B model. Usable for testing pipelines; not suitable for production or interactive use.

Can I load multiple models at once? Not within a single process. Each vllm serve instance loads one model. For multi-model setups, run multiple instances on different ports and route requests with nginx or a lightweight reverse proxy. vLLM’s deployment docs cover the nginx routing pattern.

What’s the difference between vllm serve and using vLLM in Python directly? vllm serve starts an HTTP server you call via API. The Python LLM and AsyncLLMEngine classes let you embed vLLM directly in a Python process — useful when you’re building a backend and don’t want a separate server process. The API compatibility is the same either way.

Does vLLM support GGUF models? Yes, since v0.8.x. Pass the HuggingFace ID of a GGUF repo or a local .gguf file path to vllm serve. Performance is slightly lower than native FP16/FP8 formats because the optimized CUDA kernels target native formats, but the difference is modest for most models.

Is the OpenAI API compatibility complete enough for LangChain and LlamaIndex? The endpoints both frameworks use — /v1/chat/completions, /v1/completions, /v1/embeddings — are fully compatible. Tool calling and streaming work. The Assistants API and file upload endpoints are not implemented. For RAG and agent workflows via LangChain or LlamaIndex, vLLM is a drop-in replacement.


1V1 PLAYBOOK · LOCAL LLM

Cut your local AI bill from $400/month cloud GPU to $47/month at home.

4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.

Get it for $19 (early bird) →

Sources


Was this article helpful?