Jun 24, 2026

EXO Framework Setup Guide 2026: Pool Devices for Big LLMs

By AIFoss · 10 min read

exodistributed-inferenceselfhostedaiapple-silicon

TL;DR: EXO pools the memory of several devices into one cluster so you can run models bigger than any single machine holds. In mid-2026 it’s a genuinely good Apple Silicon tool and a rough one for NVIDIA on Linux, where it still defaults to CPU. Set expectations accordingly.

	EXO	llama.cpp RPC	vLLM + Ray
Best for	Mixing Macs/devices over LAN	Splitting one model across a few nodes	Multi-GPU production serving
Setup effort	Low (auto-discovery)	Medium (manual node list)	High (cluster config)
GPU on Linux	CPU by default; NVIDIA via fork	Full CUDA/Metal	Full CUDA
The catch	Network latency tax; Linux GPU is roadmap	You wire up every node	Needs real GPUs, not laptops

Honest take: If you have two or more Apple Silicon Macs sitting around, EXO is the fastest way to run a 70B+ model across them. If you have NVIDIA cards on Linux, use vLLM or llama.cpp RPC instead — EXO isn’t there yet.

What EXO actually is

EXO (the exo-explore/exo project) connects every device on your network into a single AI cluster. The pitch is simple: you probably don’t own one machine with 128GB of unified memory, but you might own three machines with 48GB each. EXO shards a model across them so the cluster can hold what no single node can.

It’s licensed Apache 2.0, which matters — you can use it commercially without the license asterisks attached to “open weights” model releases. The repo is active (latest tagged release v1.0.71, April 23 2026, with commits landing through late June 2026), and it’s a full rewrite of the original project, which is now archived under exo-explore/ex-exo. If you find an old tutorial referencing the archived repo, ignore it.

Two architectural choices make EXO different from a typical inference server:

No master-worker. Devices connect peer-to-peer. There’s no head node to babysit — any device that’s reachable on the network can join the ring and contribute memory.
Ring memory-weighted partitioning. EXO splits the model into layers and assigns each device a number of layers proportional to its memory. A 64GB Mac Studio carries more of the model than a 16GB MacBook Air, automatically.

Devices discover each other with no manual config. Start EXO on two machines on the same LAN and they find each other. The cluster exposes a web UI and API at http://localhost:52415, and the API speaks three dialects: OpenAI Chat Completions, Anthropic’s Claude Messages format, and the Ollama API. That last one is the headline for self-hosters — anything you’ve already wired to talk to Ollama can point at EXO instead with a URL change.

The hardware reality check (read this before you buy anything)

Most EXO write-ups skip the single most important sentence in the documentation. Here it is, verbatim from the README:

“On macOS, exo uses the GPU. On Linux, exo currently runs on CPU. We are working on extending hardware accelerator support.”

Read that twice. On macOS, EXO uses the Metal GPU through Apple’s MLX framework — this is the path the project optimizes for. On Linux, the default backend (tinygrad) runs on CPU. GPU acceleration on Linux is a roadmap item, not a shipped feature in the upstream project.

That single fact reshapes the whole “build a home cluster” story. The viral benchmarks you’ve seen — pooling three RTX 3090s for a frontier model — are not something stock EXO on Linux delivers out of the box today. The maintainers’ own showcase runs are Apple Silicon: community demos pool 4× M3 Ultra Mac Studios to run Qwen3-235B, DeepSeek v3.1, and Kimi K2-class models. That’s where EXO is real in 2026.

If you have NVIDIA hardware, there’s a path, but it’s a fork-and-fiddle path — covered below.

Installing EXO

EXO is not a pip install package anymore — the v1.0 rewrite builds a Node.js dashboard and runs through the uv Python toolchain. You need uv, node, and rust installed first.

macOS — the easy path:

brew install --cask exo

That installs the prebuilt app. Prefer source? Clone and run it:

git clone https://github.com/exo-explore/exo
cd exo/dashboard && npm install && npm run build && cd ..
uv run exo

Linux:

git clone https://github.com/exo-explore/exo
cd exo/dashboard && npm install && npm run build && cd ..
uv run exo

Either way, you should see the node come up and print the dashboard URL:

$ uv run exo
exo node started
dashboard + API: http://localhost:52415
discovering peers on local network...

Open http://localhost:52415 in a browser and you get a chat UI plus a topology view of the cluster.

A real problem you’ll hit: Python version. EXO is happiest on Python 3.12. Installs on 3.13 have failed for users on Apple Silicon (tracked in GitHub issue #446 and the tinygrad version-incompatibility issue #867). If uv run exo dies during dependency resolution, pin the interpreter:

uv venv --python 3.12
uv run exo

Building a cluster

This is where EXO earns its keep. Run the same uv run exo command on a second machine on the same network. No config file, no IP list, no head node. The two nodes discover each other and the topology view in the dashboard updates to show both, with their combined memory.

To run a model, request it through the API. Because EXO speaks the OpenAI format, the call is ordinary:

curl http://localhost:52415/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
    "messages": [{"role": "user", "content": "Explain ring partitioning in one sentence."}]
  }'

EXO downloads the weights from Hugging Face on first use and shards them across the ring based on each device’s available memory. You don’t pick which layers go where — the partitioner does. Pull a bigger model than any single node can hold, and EXO spreads it; that’s the entire point.

On the interconnect: EXO ships day-0 support for RDMA over Thunderbolt 5, which the project claims cuts inter-device latency dramatically versus Wi-Fi. If you’re chaining Macs, a Thunderbolt bridge or 10GbE is a meaningful upgrade over Wi-Fi 6 — the network hop is the tax you pay for pooling, so the faster the link, the less you lose.

What performance actually looks like

Honest numbers matter more than hype. These are community-reported figures, not official benchmarks, so treat them as ballpark:

Setup	Model	Throughput (reported)
Single M2 Ultra 192GB	Llama 3.1 70B	~12–18 tok/s
2× M3 Max over Wi-Fi 6	Llama 3.1 70B	~6–10 tok/s
4× M3 Ultra Mac Studio	Qwen3-235B / DeepSeek v3.1	demoed, usable for chat

The pattern is the lesson: a single machine that can hold the model is faster than two that split it, because there’s no network hop. Pooling doesn’t make inference faster — it makes inference possible for models that wouldn’t otherwise fit. You reach for EXO when “doesn’t fit on one box” beats “a few tokens per second slower.” For a single GPU that already fits your model, a plain Ollama or MLX setup will be faster — see our Ollama MLX backend guide for the single-Mac path.

NVIDIA on Linux: the fork situation

If you searched for EXO because you want to pool consumer NVIDIA GPUs, here’s the unvarnished state in mid-2026:

Upstream EXO on Linux defaults to tinygrad on CPU. GPU users have hit the “GPU detected but showing 0.0 TFLOPS” wall (issue #821).
A community fork, Scottcjn/exo-cuda, restores NVIDIA CUDA inference through tinygrad and reports confirmed runs on Tesla V100 and M40 cards. You’ll need the NVIDIA driver, CUDA toolkit, and cuDNN installed.
Another fork, ArgentAIOS/nxo, targets NVIDIA/Linux and DGX Spark clusters specifically.

These forks work, but you’re trusting a third-party rewrite of the inference path, tracking a moving upstream, and accepting that mixing MLX and tinygrad backends in one heterogeneous cluster has documented rough edges. For a stable multi-GPU NVIDIA setup on Linux, you’re better served by vLLM with tensor parallelism or llama.cpp’s RPC mode. If you only need the big model occasionally, renting a multi-GPU box from RunPod is cheaper than buying cards that EXO can’t yet drive natively, and it sidesteps the fork maintenance entirely.

EXO vs the alternatives

Need	Use this	Why
Pool 2+ Apple Silicon Macs	EXO	Auto-discovery, MLX GPU, ring partitioning
Split one model across mixed nodes	llama.cpp RPC	Mature, CPU+GPU, but manual node config
Max throughput on NVIDIA GPUs	vLLM + Ray	Production-grade tensor/pipeline parallel
Single machine that fits the model	Ollama / LM Studio	Simpler, faster, no network hop

EXO’s edge is the zero-config Apple Silicon cluster. Nothing else makes “three Macs, one model, no setup” as painless. Its weakness is everything outside that lane.

When NOT to use EXO

You run NVIDIA GPUs on Linux and want them used. Stock EXO won’t, and the fork route is maintenance overhead. Pick vLLM or llama.cpp RPC.
Your model already fits on one machine. Pooling only adds latency. Run it locally.
You need production reliability. EXO is a fast-moving research-grade project with breaking changes between versions and Python-version landmines. It’s for tinkering and home labs, not an SLA-backed service.
Your only link is congested Wi-Fi. The network hop dominates throughput. Without Thunderbolt or wired Ethernet, expect single-digit tokens per second on large models.

FAQ

Is EXO free and open source? Yes. EXO is Apache 2.0 licensed, which permits commercial use without the restrictions some “open weight” model licenses carry.

Can I mix a Mac and a Windows PC in one EXO cluster? EXO is designed to pool heterogeneous devices, and the partitioner handles uneven memory. In practice the smooth path today is Apple Silicon; mixing a CPU-bound Linux/Windows node with GPU-accelerated Macs works but the slowest node and the network link cap your throughput.

Does EXO support an Ollama-compatible API? Yes — the cluster API at http://localhost:52415 exposes OpenAI Chat Completions, Anthropic Messages, and Ollama-compatible endpoints, so existing tooling can point at EXO with a URL change.

Why is my Linux GPU showing 0.0 TFLOPS in EXO? Upstream EXO defaults to CPU on Linux. NVIDIA acceleration requires a community fork like exo-cuda plus the CUDA toolkit and cuDNN. This is a known limitation, not a misconfiguration on your end.

What hardware do I actually need for a 70B model? Roughly 40GB+ of pooled memory at 4-bit quantization. Two 24GB+ Macs, or one large-unified-memory machine, will do it. For the underlying hardware math on memory and bandwidth, see the runaihome.com hardware guides.

Sources

Recommended Gear

Mac Studio M3 Ultra — large unified memory makes it the natural EXO cluster node
RTX 3090 — 24GB cards if you go the community-fork NVIDIA route on Linux

Was this article helpful?