vLLM Review 2026: Production LLM Inference at Scale
vLLM is the inference engine you reach for when Ollama stops being enough. Built at UC Berkeley’s Sky Computing Lab and now stewarded under the Linux Foundation’s PyTorch Foundation umbrella, it’s open-source (Apache 2.0), opinionated about performance, and deliberately harder to set up than the alternatives. That trade-off is the whole point.
If you’re serving LLMs to more than a handful of concurrent users, vLLM’s two core innovations — PagedAttention and continuous batching — change the math considerably. If you’re running a model locally for yourself, it’s overkill.
This review covers v0.21.0, released May 15, 2026, on Linux with NVIDIA hardware.
What vLLM does differently
Every LLM inference engine has to solve the same problem: the KV cache. As the model generates tokens, it needs to store key and value tensors for all previous tokens in the context window. That storage eats VRAM fast, and how it’s managed determines how many concurrent requests you can handle.
Traditional serving allocates VRAM statically — you reserve a fixed block per request and fill it as generation proceeds. The waste is significant: you’re holding memory for the maximum possible context even when actual generation is using 10% of it.
PagedAttention solves this by borrowing the OS virtual memory idea. The KV cache is divided into fixed-size pages, and only the pages actually needed by active tokens are allocated. Memory fragmentation drops to near-zero, and the same VRAM supports far more concurrent sequences.
Continuous batching is the scheduling counterpart. Traditional batched inference waits for a full batch to complete before accepting new requests. Continuous batching lets new requests slot into the batch the moment a slot opens — mid-generation. Tail latency shrinks; GPU utilization rises.
These aren’t academic improvements. They’re why vLLM benchmarks at around 187 tokens/second on Llama 3 8B under 8 concurrent users, versus Ollama’s 82 tokens/second in the same scenario. At peak throughput with multiple concurrent requests, the gap widens further — roughly 793 tok/s versus 41 tok/s according to third-party benchmarks from Markaicode and SitePoint (see Sources).
Installation
vLLM runs on Linux with NVIDIA GPUs as its primary target. The simple path:
pip install vllm
That handles CUDA 12.4 on Linux. The wheel bundles PyTorch 2.11 and all dependencies — use a fresh virtual environment to avoid version conflicts.
System requirements:
- Python ≥3.10 and <3.15 (Python 3.14 added in v0.21.0)
- Linux (Ubuntu 20.04+ recommended)
- NVIDIA GPU with CUDA 12.4; CUDA 13.0 for newer Blackwell features
- VRAM appropriate to the model you’re serving (see table below)
AMD ROCm support exists but trails the NVIDIA path — context-length limitations on AMD GPUs were still being worked through as of April 2026 (64k-token wall on certain configurations). Windows received native support in 2026 but requires CUDA 13.0 and RTX 6000 Ada or newer; WSL2 remains the more reliable Windows path for older hardware.
If you want to skip local driver setup entirely, RunPod provides vLLM-ready GPU instances with pre-installed CUDA environments. Useful for evaluating vLLM on production hardware before committing to a server build.
Serving your first model
The quickstart launches an OpenAI-compatible API server:
vllm serve meta-llama/Llama-3.2-8B-Instruct --host 0.0.0.0 --port 8000
Or with explicit configuration:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.9 \
--max-model-len 8192
--gpu-memory-utilization 0.9 tells vLLM to use 90% of available VRAM for its KV cache pool. The remaining 10% covers model weights and overhead. Tune this downward if you’re hitting OOM errors.
Once running, the server accepts standard OpenAI API calls:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-8B-Instruct",
messages=[{"role": "user", "content": "Explain PagedAttention in two sentences."}]
)
print(response.choices[0].message.content)
Any application already written for the OpenAI API works with vLLM by changing the base URL. That drop-in compatibility is the main reason teams choose it for self-hosted API infrastructure.
For 70B models across multiple GPUs, tensor parallelism is one flag away:
vllm serve meta-llama/Llama-3.2-70B-Instruct \
--tensor-parallel-size 4 \
--host 0.0.0.0 \
--port 8000
vLLM requires identical GPUs for tensor parallelism — all cards in the group need matching VRAM and compute capability.
VRAM requirements by model size
| Model size | FP16 (unquantized) | FP8 quantization | Example GPU |
|---|---|---|---|
| 7B | ~14 GB | ~8 GB | RTX 3080 10GB with FP8 |
| 13B | ~26 GB | ~14 GB | RTX 3090 24GB with FP8 |
| 34B | ~68 GB | ~36 GB | A100 80GB or 2× A100 40GB |
| 70B | ~140 GB | ~76 GB | 2× H100 80GB or 4× A100 40GB |
FP8 quantization (--quantization fp8) roughly halves VRAM requirements with minimal quality loss — it’s the first flag to add when you’re memory-constrained on consumer hardware. For a broader look at quantization tradeoffs across GGUF, AWQ, and FP8 formats, the GGUF quantization guide has the specifics.
Performance: where the advantage shows up
vLLM’s edge grows with concurrency. At a single user, the gap over Ollama is modest — about 13% — partly because Ollama’s Q4_K_M quantization uses less memory and can run faster on memory-limited consumer hardware. The architecture difference becomes clear at scale:
| Scenario | vLLM (FP16) | Ollama (Q4_K_M) |
|---|---|---|
| 1 concurrent user, Llama 3 8B | ~71 tok/s | ~62 tok/s |
| 8 concurrent users | ~187 tok/s | ~82 tok/s |
| Peak sustained throughput | ~793 tok/s | ~41 tok/s |
Benchmarks from Markaicode and SitePoint 2026 testing; see Sources for links.
The single-user numbers are close because vLLM’s batching machinery has nothing to batch. Its advantage is in keeping GPU utilization high across many simultaneous requests — Ollama’s throughput flattens almost immediately under concurrent load, while vLLM’s scales smoothly.
Against TensorRT-LLM (NVIDIA’s proprietary engine), vLLM trades a few percent of peak throughput for dramatically simpler setup and model-agnostic architecture. TGI (Hugging Face’s Text Generation Inference) occupies the same niche as vLLM and is worth comparing if you’re already deep in the Hugging Face ecosystem.
For a detailed hardware-by-hardware Ollama vs vLLM breakdown, the Ollama vs vLLM comparison has the numbers.
Supported models in v0.21.0
vLLM’s model support covers most of the architectures that matter:
- Llama family: Llama 3.x, Llama 4
- Mistral/Mixtral: Mistral 7B, Mixtral 8x7B, Mixtral 8x22B
- Qwen: Qwen 2.5, Qwen 3.5, Qwen-VL vision-language variants
- DeepSeek: DeepSeek V3, V4, R1 (with MLA attention support)
- Gemma: Gemma 2, Gemma 3
- Phi: Microsoft Phi-3, Phi-4
- Vision-language models: LLaVA, InternVL, Moondream3 (added in v0.21.0)
New architectures from Hugging Face Transformers typically land in vLLM within weeks of release. The project moves fast — v0.21.0 shipped 367 commits from 202 contributors, and v0.20.0 (one month earlier) had 752 commits from 320.
What changed in v0.21.0
May 2026 brought a significant feature update. The headline additions:
Speculative decoding with thinking budget support — reduces latency for reasoning models like DeepSeek R1 by drafting candidate tokens with a smaller model and verifying them in batches. Useful when you need low time-to-first-token for long chain-of-thought completions.
KV Offload + Hybrid Memory Allocator — offloads KV cache pages to CPU RAM when VRAM is under pressure. Throughput drops when pages are paged out, but effective context window extends beyond what VRAM alone would support. The trade-off is acceptable for long-document workloads with bursty concurrency.
C++20 build requirement — breaking change for source builds. Pre-built wheels aren’t affected, but if you’re compiling from source (e.g., for custom CUDA versions), your toolchain needs to be current.
v0.20.0 (April 27) laid the groundwork: FlashAttention 4 as default MLA prefill, CUDA 13.0 as default, PyTorch 2.11 upgrade, and initial DeepSeek V4 support. If you’re running on CUDA 12.x hardware, v0.20.x behavior still applies until you upgrade your driver stack.
When NOT to use vLLM
Single-user local setups. Running models on a personal machine for coding, writing, or research? Ollama is the better fit. vLLM’s setup overhead and Linux/NVIDIA requirements are unnecessary friction when you’re the only user. The Ollama review covers that use case.
Windows without WSL on older hardware. Native Windows support landed in 2026 but requires CUDA 13.0 and an RTX 6000 Ada or newer. On a 3080 or 4090, WSL2 still works but adds indirection that complicates debugging and resource monitoring.
Teams without DevOps capacity. vLLM is a server you operate. You’re responsible for monitoring memory pressure, handling OOM crashes, managing model updates, and tuning parameters for new models. That’s fine for production engineering teams; it’s friction for a two-person startup that needs something running by Thursday.
Consumer hardware with 8-12GB VRAM. vLLM’s native quantization (FP8, AWQ) targets server GPUs. On a 3080 10GB trying to serve a 7B model to yourself, Ollama’s GGUF Q4_K_M loading is simpler and gets comparable single-user throughput. vLLM’s edge only shows up at concurrency you’re unlikely to generate alone.
AMD GPU users. ROCm support is functional but incomplete — context-length bugs on AMD hardware, weaker documentation, and a slower feature cadence compared to the NVIDIA path. Check the vLLM issue tracker before committing to AMD for production use.
vLLM on cloud hardware
If you want production-grade serving without managing bare-metal, cloud GPU rental changes the calculus. Running vLLM on RunPod gives you an H100 80GB with a clean CUDA environment and none of the driver setup. At current GPU rental rates, this is viable for small API products, private inference for teams of 5-50, or model benchmarking before hardware procurement.
For the hardware side — whether to build a multi-GPU server or rent — runaihome.com covers GPU server build costs and cloud rental tradeoffs in detail.
The verdict
vLLM is the right tool when you’re building something multi-user: an internal API for a team, a customer-facing inference endpoint, a research cluster. Its architecture is genuinely well-designed for that job, and v0.21.0 continues adding features (speculative decoding, KV offload) that used to require commercial alternatives.
For everything else — personal use, low-VRAM hardware, GUI-first workflows, Windows desktops without WSL — there are better-matched tools. The difficulty of vLLM setup isn’t a flaw; it’s a signal about who it’s built for.
If you’re deciding where vLLM fits in a broader self-hosted stack alongside RAG pipelines and chat UIs, the open-source AI stack guide covers how inference engines, frontends, and document retrieval components fit together.
1V1 PLAYBOOK · LOCAL LLM
Cut your local AI bill from $400/month cloud GPU to $47/month at home.
4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.
Get it for $19 (early bird) →Sources
- vLLM GitHub Releases — vllm-project/vllm
- vLLM GPU Installation — official docs (stable)
- vllm on PyPI — version history and install notes
- ollama vs vLLM Throughput Benchmark 2026 — Markaicode
- Ollama vs vLLM: Performance Benchmark 2026 — SitePoint
- Ollama vs vLLM: A Deep Dive into Performance Benchmarking — Red Hat Developer
- vLLM OpenAI-Compatible Server — official docs
- GPU Requirements Cheat Sheet 2026 — Spheron Blog
- vLLM Multi-GPU Setup Guide — Will It Run AI
- Breaking Context Limits on AMD GPUs: Patching vLLM — dasroot.net
Recommended Gear
The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →