EXO Framework Setup Guide 2026: Pool Devices for Big LLMs
TL;DR: EXO pools the memory of several devices into one cluster so you can run models bigger than any single machine holds. In mid-2026 it’s a genuinely good Apple Silicon tool and a rough one for NVIDIA on Linux, where it still defaults to CPU. Set expectations accordingly.
| EXO | llama.cpp RPC | vLLM + Ray | |
|---|---|---|---|
| Best for | Mixing Macs/devices over LAN | Splitting one model across a few nodes | Multi-GPU production serving |
| Setup effort | Low (auto-discovery) | Medium (manual node list) | High (cluster config) |
| GPU on Linux | CPU by default; NVIDIA via fork | Full CUDA/Metal | Full CUDA |
| The catch | Network latency tax; Linux GPU is roadmap | You wire up every node | Needs real GPUs, not laptops |
Honest take: If you have two or more Apple Silicon Macs sitting around, EXO is the fastest way to run a 70B+ model across them. If you have NVIDIA cards on Linux, use vLLM or llama.cpp RPC instead — EXO isn’t there yet.
What EXO actually is
EXO (the exo-explore/exo project) connects every device on your network into a single AI cluster. The pitch is simple: you probably don’t own one machine with 128GB of unified memory, but you might own three machines with 48GB each. EXO shards a model across them so the cluster can hold what no single node can.
It’s licensed Apache 2.0, which matters — you can use it commercially without the license asterisks attached to “open weights” model releases. The repo is active (latest tagged release v1.0.71, April 23 2026, with commits landing through late June 2026), and it’s a full rewrite of the original project, which is now archived under exo-explore/ex-exo. If you find an old tutorial referencing the archived repo, ignore it.
Two architectural choices make EXO different from a typical inference server:
- No master-worker. Devices connect peer-to-peer. There’s no head node to babysit — any device that’s reachable on the network can join the ring and contribute memory.
- Ring memory-weighted partitioning. EXO splits the model into layers and assigns each device a number of layers proportional to its memory. A 64GB Mac Studio carries more of the model than a 16GB MacBook Air, automatically.
Devices discover each other with no manual config. Start EXO on two machines on the same LAN and they find each other. The cluster exposes a web UI and API at http://localhost:52415, and the API speaks three dialects: OpenAI Chat Completions, Anthropic’s Claude Messages format, and the Ollama API. That last one is the headline for self-hosters — anything you’ve already wired to talk to Ollama can point at EXO instead with a URL change.
The hardware reality check (read this before you buy anything)
Most EXO write-ups skip the single most important sentence in the documentation. Here it is, verbatim from the README:
“On macOS, exo uses the GPU. On Linux, exo currently runs on CPU. We are working on extending hardware accelerator support.”
Read that twice. On macOS, EXO uses the Metal GPU through Apple’s MLX framework — this is the path the project optimizes for. On Linux, the default backend (tinygrad) runs on CPU. GPU acceleration on Linux is a roadmap item, not a shipped feature in the upstream project.
That single fact reshapes the whole “build a home cluster” story. The viral benchmarks you’ve seen — pooling three RTX 3090s for a frontier model — are not something stock EXO on Linux delivers out of the box today. The maintainers’ own showcase runs are Apple Silicon: community demos pool 4× M3 Ultra Mac Studios to run Qwen3-235B, DeepSeek v3.1, and Kimi K2-class models. That’s where EXO is real in 2026.
If you have NVIDIA hardware, there’s a path, but it’s a fork-and-fiddle path — covered below.
Installing EXO
EXO is not a pip install package anymore — the v1.0 rewrite builds a Node.js dashboard and runs through the uv Python toolchain. You need uv, node, and rust installed first.
macOS — the easy path:
brew install --cask exo
That installs the prebuilt app. Prefer source? Clone and run it:
git clone https://github.com/exo-explore/exo
cd exo/dashboard && npm install && npm run build && cd ..
uv run exo
Linux:
git clone https://github.com/exo-explore/exo
cd exo/dashboard && npm install && npm run build && cd ..
uv run exo
Either way, you should see the node come up and print the dashboard URL:
$ uv run exo
exo node started
dashboard + API: http://localhost:52415
discovering peers on local network...
Open http://localhost:52415 in a browser and you get a chat UI plus a topology view of the cluster.
A real problem you’ll hit: Python version. EXO is happiest on Python 3.12. Installs on 3.13 have failed for users on Apple Silicon (tracked in GitHub issue #446 and the tinygrad version-incompatibility issue #867). If uv run exo dies during dependency resolution, pin the interpreter:
uv venv --python 3.12
uv run exo
Building a cluster
This is where EXO earns its keep. Run the same uv run exo command on a second machine on the same network. No config file, no IP list, no head node. The two nodes discover each other and the topology view in the dashboard updates to show both, with their combined memory.
To run a model, request it through the API. Because EXO speaks the OpenAI format, the call is ordinary:
curl http://localhost:52415/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"messages": [{"role": "user", "content": "Explain ring partitioning in one sentence."}]
}'
EXO downloads the weights from Hugging Face on first use and shards them across the ring based on each device’s available memory. You don’t pick which layers go where — the partitioner does. Pull a bigger model than any single node can hold, and EXO spreads it; that’s the entire point.
On the interconnect: EXO ships day-0 support for RDMA over Thunderbolt 5, which the project claims cuts inter-device latency dramatically versus Wi-Fi. If you’re chaining Macs, a Thunderbolt bridge or 10GbE is a meaningful upgrade over Wi-Fi 6 — the network hop is the tax you pay for pooling, so the faster the link, the less you lose.
What performance actually looks like
Honest numbers matter more than hype. These are community-reported figures, not official benchmarks, so treat them as ballpark:
| Setup | Model | Throughput (reported) |
|---|---|---|
| Single M2 Ultra 192GB | Llama 3.1 70B | ~12–18 tok/s |
| 2× M3 Max over Wi-Fi 6 | Llama 3.1 70B | ~6–10 tok/s |
| 4× M3 Ultra Mac Studio | Qwen3-235B / DeepSeek v3.1 | demoed, usable for chat |
The pattern is the lesson: a single machine that can hold the model is faster than two that split it, because there’s no network hop. Pooling doesn’t make inference faster — it makes inference possible for models that wouldn’t otherwise fit. You reach for EXO when “doesn’t fit on one box” beats “a few tokens per second slower.” For a single GPU that already fits your model, a plain Ollama or MLX setup will be faster — see our Ollama MLX backend guide for the single-Mac path.
NVIDIA on Linux: the fork situation
If you searched for EXO because you want to pool consumer NVIDIA GPUs, here’s the unvarnished state in mid-2026:
- Upstream EXO on Linux defaults to tinygrad on CPU. GPU users have hit the “GPU detected but showing 0.0 TFLOPS” wall (issue #821).
- A community fork, Scottcjn/exo-cuda, restores NVIDIA CUDA inference through tinygrad and reports confirmed runs on Tesla V100 and M40 cards. You’ll need the NVIDIA driver, CUDA toolkit, and cuDNN installed.
- Another fork, ArgentAIOS/nxo, targets NVIDIA/Linux and DGX Spark clusters specifically.
These forks work, but you’re trusting a third-party rewrite of the inference path, tracking a moving upstream, and accepting that mixing MLX and tinygrad backends in one heterogeneous cluster has documented rough edges. For a stable multi-GPU NVIDIA setup on Linux, you’re better served by vLLM with tensor parallelism or llama.cpp’s RPC mode. If you only need the big model occasionally, renting a multi-GPU box from RunPod is cheaper than buying cards that EXO can’t yet drive natively, and it sidesteps the fork maintenance entirely.
EXO vs the alternatives
| Need | Use this | Why |
|---|---|---|
| Pool 2+ Apple Silicon Macs | EXO | Auto-discovery, MLX GPU, ring partitioning |
| Split one model across mixed nodes | llama.cpp RPC | Mature, CPU+GPU, but manual node config |
| Max throughput on NVIDIA GPUs | vLLM + Ray | Production-grade tensor/pipeline parallel |
| Single machine that fits the model | Ollama / LM Studio | Simpler, faster, no network hop |
EXO’s edge is the zero-config Apple Silicon cluster. Nothing else makes “three Macs, one model, no setup” as painless. Its weakness is everything outside that lane.
When NOT to use EXO
- You run NVIDIA GPUs on Linux and want them used. Stock EXO won’t, and the fork route is maintenance overhead. Pick vLLM or llama.cpp RPC.
- Your model already fits on one machine. Pooling only adds latency. Run it locally.
- You need production reliability. EXO is a fast-moving research-grade project with breaking changes between versions and Python-version landmines. It’s for tinkering and home labs, not an SLA-backed service.
- Your only link is congested Wi-Fi. The network hop dominates throughput. Without Thunderbolt or wired Ethernet, expect single-digit tokens per second on large models.
FAQ
Is EXO free and open source? Yes. EXO is Apache 2.0 licensed, which permits commercial use without the restrictions some “open weight” model licenses carry.
Can I mix a Mac and a Windows PC in one EXO cluster? EXO is designed to pool heterogeneous devices, and the partitioner handles uneven memory. In practice the smooth path today is Apple Silicon; mixing a CPU-bound Linux/Windows node with GPU-accelerated Macs works but the slowest node and the network link cap your throughput.
Does EXO support an Ollama-compatible API?
Yes — the cluster API at http://localhost:52415 exposes OpenAI Chat Completions, Anthropic Messages, and Ollama-compatible endpoints, so existing tooling can point at EXO with a URL change.
Why is my Linux GPU showing 0.0 TFLOPS in EXO? Upstream EXO defaults to CPU on Linux. NVIDIA acceleration requires a community fork like exo-cuda plus the CUDA toolkit and cuDNN. This is a known limitation, not a misconfiguration on your end.
What hardware do I actually need for a 70B model? Roughly 40GB+ of pooled memory at 4-bit quantization. Two 24GB+ Macs, or one large-unified-memory machine, will do it. For the underlying hardware math on memory and bandwidth, see the runaihome.com hardware guides.
Sources
- exo-explore/exo — official GitHub repository (README, hardware support, install)
- exo-explore/exo — Getting Started (DeepWiki)
- Scottcjn/exo-cuda — NVIDIA CUDA fork via tinygrad
- exo GitHub issue #446 — Python 3.13 install failures
- exo GitHub issue #821 — GPU detected but 0.0 TFLOPS
- Self-Hosting Llama-70B on Apple Silicon with Exo and MLX (Medium)
Recommended Gear
- Mac Studio M3 Ultra — large unified memory makes it the natural EXO cluster node
- RTX 3090 — 24GB cards if you go the community-fork NVIDIA route on Linux
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →