Jun 25, 2026

NVIDIA Cosmos 3 Nano Self-Hosting Guide 2026: vLLM Setup

By AIFoss · 10 min read

nvidiacosmosvllmselfhostedai

TL;DR: Cosmos 3 Nano is the first fully open omnimodal world model — one 16B model that takes text, images, video, audio, and action trajectories in, and produces any of those back out. It self-hosts through a single vLLM-Omni Docker image. The catch is hardware: NVIDIA built it for a 96GB workstation card, not a gaming GPU.

	Cosmos 3 Nano	A standard 16B LLM	Cosmos 3 Super
Best for	Robotics, world simulation, action reasoning	Text chat, RAG, coding	Larger physical-AI workloads
Modalities	Text + image + video + audio + action	Text (sometimes vision)	Same as Nano, higher fidelity
Realistic VRAM	48GB+ (NVIDIA targets 96GB)	12–16GB at 4-bit	80GB+
License	OpenMDW-1.1 (commercial OK, attribution)	Varies	OpenMDW-1.1

Honest take: If you build physical AI — robots, autonomous systems, synthetic training video — this is the most important open release of the year. If you wanted a 10GB drop-in for your RTX 4090, this is not that model. Read the VRAM section before you docker pull.

NVIDIA released Cosmos 3 at GTC Taipei / Computex in late May 2026, with the full technical report following on June 22, 2026. Cosmos 3 Nano (nvidia/Cosmos3-Nano) is the small member of the family — “small” being relative, since it still packs 16B parameters and a full generative pipeline.

What you’ll have running after this guide:

A Cosmos 3 Nano server on http://localhost:8000/v1 behind an OpenAI-compatible API
A clear picture of whether your GPU can actually run it, and what to rent if it can’t
A working understanding of the OpenMDW-1.1 license so you know what you can ship commercially

What “omnimodal physical AI model” actually means

Most local models you run are language models. You feed them text, you get text. Vision-language models add image input. Cosmos 3 Nano goes further: it is a world model. It reasons about physical environments and can generate them.

The architecture splits 16B parameters into two halves: an 8B reasoner and an 8B generator. The reasoner understands a scene and predicts what happens next; the generator produces the output — which can be text, an image, a video clip, ambient sound, or a sequence of robot actions. NVIDIA’s pitch is that this collapses physical-AI training and evaluation cycles from months to days, because you can simulate and reason about the physical world in a single model instead of stitching together a perception stack, a planner, and a renderer.

That makes it useful for a narrow but real set of jobs: robotics inference, autonomous-system dataset generation, smart-space perception, and synthetic training video for environments you can’t safely or cheaply film. It is not a general-purpose chatbot. Pointing it at “summarize this PDF” is using a forklift to carry a coffee cup.

The VRAM reality (read this first)

The queue brief that inspired this article guessed “~10GB at Q4, fits an RTX 4090.” That guess is wrong, and it’s worth correcting because it’s the single biggest reason a self-hoster will hit a wall.

Cosmos 3 Nano ships with BF16 weights. Sixteen billion parameters in BF16 is roughly 32GB of weights alone — before you account for the generator’s diffusion pipeline, which needs substantial activation memory to produce video and images. NVIDIA’s own model documentation says Nano is “optimized for efficient inference and designed to run on workstation-grade compute like the RTX PRO 6000 GPU.” The RTX PRO 6000 Blackwell carries 96GB of GDDR7.

Here’s the honest hardware ladder:

Hardware	VRAM	Cosmos 3 Nano outcome
RTX 4090 / 5090	24–32GB	Tight to impossible for the full omni pipeline; reasoner-only experiments may fit
2× RTX 4090 / RTX 6000 Ada	48GB	Workable with care; expect to tune batch and resolution
RTX PRO 6000 Blackwell	96GB	NVIDIA’s target — runs comfortably single-GPU
Cloud GPU	—	An H100 or RTX PRO 6000 on RunPod avoids the buy-in

If you are evaluating whether to buy a workstation card for this, our sister site has the hardware breakdown at runaihome.com. For a one-off test, renting beats buying — Cosmos 3 Nano is a “try it on cloud GPU first” model, not a “spin it up on the gaming rig” model. There is no public 4-bit GGUF that makes this fit on consumer hardware as of June 2026; the supported path is BF16 through vLLM-Omni.

Prerequisites

Before you pull the image, check the boxes NVIDIA actually requires:

GPU architecture: Ampere, Hopper, or Blackwell. Older cards (Turing, Pascal) are not supported.
OS: Linux. There is no native Windows path; use WSL or a Linux box.
CUDA: 12.8 or 13, with a matching driver. Mismatched driver/toolkit versions are the most common silent failure.
Docker with the NVIDIA Container Toolkit (--runtime nvidia).
Disk: budget ~40GB for the model weights and image cache.

A quick sanity check before anything else:

$ nvidia-smi --query-gpu=name,memory.total --format=csv
name, memory.total [MiB]
NVIDIA RTX PRO 6000 Blackwell, 97887 MiB

If that command errors or shows a card with under 48GB, stop here and reread the VRAM section.

Deploy with vLLM-Omni Docker

NVIDIA maintains vllm/vllm-omni:cosmos3 as the all-in-one deployment image. It bundles the omni serving path so you don’t assemble the diffusion pipeline by hand. This is the supported deployment, straight from the Cosmos GitHub repo:

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd):/workspace" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --port 8000 \
  --init-timeout 1800

A few flags matter:

--omni switches vLLM into omnimodal mode — without it, the server tries to load Cosmos as a plain LLM and fails.
--model-class-name Cosmos3OmniDiffusersPipeline selects the generator pipeline. Typos here produce a class-not-found error at startup.
--allowed-local-media-path / lets the server read local image/video files you reference in requests. Tighten this to a specific directory in production.
--init-timeout 1800 gives the container 30 minutes to download weights and warm up. On a first run, weight download dominates; don’t kill it early.

When it’s ready you’ll see vLLM’s standard startup line:

INFO:     Started server process [1]
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

The endpoint speaks the OpenAI API, so any client you already use against vLLM points at it with no code changes — you just send richer multimodal payloads.

A problem you’ll actually hit: the init timeout and IPC

The two failures that bite first-time users are both in the command above, and both are easy to misdiagnose.

First, the container appears to hang for several minutes on startup. It isn’t hung — it’s downloading ~32GB of BF16 weights to ~/.cache/huggingface. The --init-timeout 1800 flag exists precisely because the default timeout is too short for this model. If you removed it or set it low, vLLM kills itself mid-download and you see a cryptic “initialization exceeded timeout” with no obvious cause. Keep the 30-minute window on the first run; subsequent runs read from cache and start in under a minute.

Second, omnimodal generation crashes with a shared-memory error if you drop --ipc=host. The diffusion generator uses PyTorch DataLoader workers that need more /dev/shm than Docker’s 64MB default. The symptom is a Bus error or RuntimeError: DataLoader worker killed partway through the first video generation. --ipc=host (or --shm-size=8g) fixes it. This one is easy to blame on the GPU when it’s really a container config issue.

The OpenMDW-1.1 license: what you can actually ship

Cosmos 3 Nano is released under the OpenMDW-1.1 license, a Linux Foundation license for model weights and data. For self-hosters and commercial teams, this is genuinely permissive — closer to Apache 2.0 than to a restrictive “community license” — but it is not Apache 2.0, and the difference is a real obligation.

What OpenMDW-1.1 permits:

Commercial use
Modification and fine-tuning
Redistribution
Building and distributing derivative models

The one constraint that matters: products built on Cosmos must display “Built on NVIDIA Cosmos” somewhere visible — a website footer, an app’s About page, or product documentation. That’s lighter than Llama’s monthly-active-user clause or MiniMax’s non-commercial terms, but heavier than Apache’s “do whatever, just keep the notice file.” If you’re shipping a commercial robotics product, factor the attribution into your UI review.

If license clarity drives your model choices, our open-source LLM licensing guide walks through how OpenMDW compares to MIT, Apache, and the Llama Community License. NVIDIA also offers custom licensing through cosmos-license@nvidia.com if the attribution requirement is a dealbreaker.

When NOT to self-host Cosmos 3 Nano

Be honest with yourself about the use case before committing a GPU server:

You want a chatbot or coding assistant. Cosmos is a physical-AI world model. A 14B Qwen or a Devstral will serve text tasks far better and run on a fraction of the VRAM.
You only have a consumer GPU. Without a 48GB+ card (and ideally the 96GB RTX PRO 6000), you’ll spend more time fighting OOM errors than building. Rent cloud GPU instead.
You need a quick proof of concept. The first-run download and warmup is slow, and the omni pipeline has more moving parts than a text LLM. build.nvidia.com lets you try Cosmos 3 in a browser with no GPU — start there before you self-host.
Your task doesn’t touch the physical world. Document RAG, summarization, classification — none of these benefit from a world model. You’re paying a heavy modality tax for capability you won’t use.

Cosmos 3 Nano is a specialist tool. Used for what it’s built for — robotics, simulation, synthetic physical data — it’s the most capable open option available. Used as a general model, it’s an expensive mismatch.

FAQ

Is Cosmos 3 Nano really free for commercial use? Yes, under OpenMDW-1.1 — commercial use, modification, and redistribution are all allowed. The only requirement is a visible “Built on NVIDIA Cosmos” attribution in your product.

Can I run it on an RTX 4090? Not realistically for the full omnimodal pipeline. The model ships in BF16 (~32GB of weights) plus generator overhead, and NVIDIA targets the 96GB RTX PRO 6000. A 24GB card can’t hold it; even 48GB is tight. Rent cloud GPU for testing.

What’s the difference between Cosmos 3 Nano and Super? Nano is the 16B model optimized for efficient inference. Super is the 64B model for higher-fidelity, larger physical-AI workloads. Both use OpenMDW-1.1 and the same vLLM-Omni deployment path.

Do I need vLLM-Omni, or will regular vLLM work? Use vLLM-Omni for Cosmos 3 Nano. The vllm/vllm-omni:cosmos3 image and the --omni flag enable the diffusion generator pipeline that standard vLLM doesn’t load. The model is also supported through NVIDIA’s Cosmos framework, PyTorch, and Hugging Face Diffusers.

What modalities can it actually output? Text, images, video, ambient sound, and action sequences — from any combination of those same modalities as input, including action trajectories.

Sources

Was this article helpful?