AMD Lemonade Review 2026: GPU, NPU, and Multi-Modal
TL;DR: Lemonade v10.6 is AMD’s open-source LLM server that adds NPU prefill acceleration, image gen, and speech to one OpenAI-compatible endpoint. NPU acceleration works only on Ryzen AI 300/400 chips — on other hardware, Ollama’s ecosystem is wider. AMD Ryzen AI users should pick Lemonade; everyone else should consider Ollama first.
| Lemonade v10.6 | Ollama v0.6 | LocalAI | |
|---|---|---|---|
| Best for | AMD GPU + NPU hybrid, multi-modal | Cross-platform, broadest ecosystem | OpenAI API proxy, any hardware |
| Install | winget or Snap | curl one-liner | Docker Compose |
| Hardware | AMD RDNA3+, NVIDIA, Apple M, CPU | Any GPU | Any hardware |
| Model formats | GGUF, ONNX, FLM, SafeTensors | GGUF (Ollama manifest) | GGUF, OpenVINO, more |
| Multi-modal | LLM + image gen + Whisper + TTS | LLM + vision models | LLM + Whisper + SD |
| The catch | NPU only on Ryzen AI 300/400 | No NPU acceleration | High setup complexity |
Honest take: On a Ryzen AI 300-series machine, Lemonade is the better daily driver — it uses hardware that Ollama leaves idle and bundles image gen plus speech in one package. On Nvidia hardware or wherever you need maximum integration coverage, stick with Ollama.
What Lemonade Is and Why AMD Built It
Ollama solved cross-platform local LLM deployment cleanly. But it left AMD NPU owners with idle hardware — the dedicated AI accelerators in Ryzen AI chips sat unused because Ollama has no FastFlowLM backend.
Lemonade is AMD’s answer. Released under Apache 2.0 and available at github.com/lemonade-sdk/lemonade, it bundles:
- An OpenAI-compatible HTTP API at
http://localhost:13305/v1 - llama.cpp with Vulkan backend for AMD and NVIDIA GPUs
- FastFlowLM for XDNA2 NPU acceleration on Ryzen AI chips
- Stable Diffusion image generation
- Whisper speech-to-text
- Kokoro text-to-speech
- A model manager with one-command downloads from Hugging Face
The core design difference from Ollama is hardware-tier splitting. On a Ryzen AI 300-series chip, prompt processing (prefill) goes to the NPU while token generation (decode) goes to the iGPU. This is not marketing — the NPU has better compute throughput for dense matrix math during prefill, and the iGPU has better memory bandwidth for sequential token generation. The result is lower Time to First Token on long system prompts and agentic chains.
Current version: v10.6.0 (released May 21, 2026). Linux NPU support shipped with Lemonade 10.0 in March 2026 via the FastFlowLM runtime.
Hardware Compatibility
| Platform | Backend | Notes |
|---|---|---|
| AMD Ryzen AI 300/400 (XDNA2) | FastFlowLM NPU + Vulkan iGPU | Strix Halo supports up to 128 GB unified memory |
| AMD Radeon discrete (RDNA2/3/4) | llama.cpp + Vulkan | Standard VRAM limits; add 2–4 GB overhead |
| NVIDIA (Turing–Blackwell) | llama.cpp + Vulkan or CUDA | CUDA backend available since v10+ |
| Apple Silicon (M1–M4) | Metal via llama.cpp | Unified memory; M4 Max competitive at large models |
| x86_64 CPU | llama.cpp CPU | Small models only; no hardware acceleration |
NPU acceleration requires Ryzen AI 300-series or 400-series specifically — the XDNA2 architecture. Earlier Ryzen AI chips (7000, 8000, 200-series) have NPUs that no current runtime supports for LLM inference. On those systems, Lemonade falls back to Vulkan on the GPU, which is functionally the same as running Ollama.
Supported Linux distros: Ubuntu 24.04+, Fedora 43+, Debian Trixie+, Arch. Docker and Snap packages are available. For hardware context on AMD GPU builds, see runaihome.com for current RDNA4 GPU benchmarks and build guides.
Installation
Windows
winget install AMD.LemonadeServer
This installs the server and a Tauri desktop app (system-tray GUI for model downloads and server management). Alternatively, grab the .msi from the GitHub releases page. After install, the server starts automatically on port 13305.
Linux (Ubuntu 24.04+)
# Snap — works across Ubuntu 24.04+, Fedora 43+, Arch
sudo snap install lemonade
# Docker
docker run -d --gpus all -p 13305:13305 lemonadesdk/lemonade:latest
For NPU support on Linux, you need the XDNA driver and FastFlowLM runtime installed separately — the Lemonade docs cover the dependency chain. It is more involved than the Windows path. For most Linux users without a Ryzen AI 300/400 chip, the Snap install with Vulkan fallback is the practical path.
Verify the server is running
curl http://localhost:13305/v1/models
Expected output on a fresh install with no models downloaded:
{"object":"list","data":[]}
Running Your First Model
lemonade run Gemma-4-E2B-it-GGUF
This pulls the model from Hugging Face (if not cached) and starts a chat session in your terminal. The model manager uses Hugging Face slug format — you can also import any custom GGUF or ONNX model from Hugging Face directly.
Check which backend Lemonade selected for your hardware:
curl http://localhost:13305/stats
The response includes the active inference engine: vulkan, fastflowlm, rocm, or cpu. If you expected fastflowlm and got vulkan, check that your XDNA driver is installed and you’re on a Ryzen AI 300/400 chip.
To test image generation:
curl http://localhost:13305/v1/images/generations \
-H "Content-Type: application/json" \
-d '{"prompt": "a terminal screen in a dark room", "n": 1}'
NPU + GPU Hybrid: Numbers From Real Hardware
On a Ryzen AI Max+ 395 (Strix Halo, 128 GB unified memory), the NPU handles prompt processing and the iGPU handles decode. Community benchmarks from May–June 2026 on this configuration:
| Model | Quantization | Tokens/sec |
|---|---|---|
| GPT-OSS 120B | Q4_K_M | ~50 tok/s |
| Qwen3.5-122B | Q4 | ~35 tok/s |
| Qwen3-Coder-Next | Q4 | ~43 tok/s |
For comparison: an RTX 4090 running llama.cpp hits 50–80 tok/s on 7B models at Q4 but stalls on 70B+ without aggressive quantization (limited by 24 GB VRAM). The Strix Halo runs 120B at full Q4_K_M in 128 GB of unified memory — a different tier of capability.
On smaller Ryzen AI 300 systems (Strix Point, 32–64 GB), expect:
- Llama 3.2-3B on NPU: ~28 tok/s at under 2 W
- Models above 8B: fall back to iGPU via Vulkan
FastFlowLM 0.9.35, the current NPU runtime bundled in Lemonade 10.6, supports context windows up to 256k tokens on XDNA2 NPUs.
Multi-Modal in One Server
Lemonade bundles three additional inference backends behind the same API port:
Image generation: SDXL-Turbo via /v1/images/generations. Any client that supports the OpenAI image endpoint works — including the ComfyUI API adapter. See our ComfyUI API tutorial for chaining this into automated pipelines.
Speech-to-text: Whisper backend via /v1/audio/transcriptions. Uses the same model weights as whisper.cpp.
Text-to-speech: Kokoro TTS via /v1/audio/speech. Known limitation as of v10.6: voices not in the pre-configured list produce muted audio. Custom voice loading is not yet supported.
Running these three modalities as separate services (Ollama + ComfyUI + a Whisper server) adds coordination overhead — three processes, three ports, three model caches. Lemonade consolidates them into one service with one model manager. For a home server running all three, that’s meaningful.
Connecting to Open WebUI
Open WebUI supports custom OpenAI-compatible endpoints. To add Lemonade:
- Open WebUI settings → Connections → Add Connection
- API URL:
http://localhost:13305/v1 - API key: leave blank (Lemonade does not validate keys)
- Save and confirm models appear in the model list
If you’re running Open WebUI in Docker and Lemonade natively on the host:
http://host.docker.internal:13305/v1
The rest of the setup is identical to the Ollama path — see our Ollama + Open WebUI on Linux guide for the full stack. Swap the backend URL and everything else carries over.
Continue (the VS Code AI extension) supports Lemonade the same way: set a custom OpenAI base URL in .continue/config.json. Dify and n8n both have validated connectors for OpenAI-compatible endpoints — Lemonade drops in without extra configuration.
When NOT to Use Lemonade
On pre-300-series AMD hardware. Ryzen AI 7000, 8000, and 200-series NPUs aren’t supported for LLM inference by any current runtime. Lemonade falls back to CPU or Vulkan GPU. At that point, Ollama’s wider model library and ecosystem make it the better choice.
When ecosystem breadth matters. Ollama has ~95,000 GitHub stars and integrations across LangChain, LlamaIndex, Open WebUI plugins, VS Code extensions, Raycast, and hundreds of community tools. Lemonade’s validated integrations cover the main ones but are narrower. Expect rough edges in any integration not explicitly listed in the docs.
For production multi-user workloads. Neither Lemonade nor Ollama are designed for high-concurrency serving with proper auth, rate limiting, and health checks. For that, vLLM is the right tool — see our vLLM production setup guide. If you want to evaluate cloud vs. self-hosted for production GPU workloads, RunPod is worth benchmarking against a local setup before committing to hardware.
When the model you need isn’t in Lemonade’s manager. Ollama maintains a curated library with version tags and tested compatibility. Lemonade pulls from Hugging Face generically — more flexible, but you will hit format or compatibility issues with obscure models that Ollama’s curation catches before publishing.
For understanding quantization trade-offs when selecting models, see our GGUF quantization guide.
FAQ
Does Lemonade work on Windows 10? No. Windows 11 is required. AMD’s Ryzen AI NPU drivers are Windows 11-only, and the desktop app requires it as well.
Can I run Lemonade and Ollama simultaneously? Yes — they use different ports (Lemonade: 13305, Ollama: 11434). Running both at once is a reasonable setup: Lemonade for NPU-accelerated models and image gen, Ollama for models with better library coverage.
Does Lemonade support LoRA adapters? Not through the model manager API as of v10.6.0. The GGUF backend inherits llama.cpp’s command-line LoRA support, but there’s no model-manager UI for LoRA loading. It’s on the roadmap per GitHub issues.
What is the difference between Lemonade’s AMD GPU path and ROCm?
Lemonade defaults to llama.cpp + Vulkan for AMD GPUs, which works on all RDNA cards without a full ROCm installation. The experimental vllm:rocm backend is available for Ryzen AI Max+ on Linux but requires the full ROCm stack. For most users, Vulkan is the practical path.
Is NPU inference output deterministic? FastFlowLM on XDNA2 produces deterministic output within the same model version and quantization. Switching quantization levels or model versions will produce different outputs, same as any other inference backend.
Sources
- Lemonade GitHub — lemonade-sdk/lemonade
- FastFlowLM GitHub — FastFlowLM/FastFlowLM
- Lemonade FAQ — hardware requirements and known limitations
- Lemonade for Local AI — AMD Developer Technical Article (2026)
- AMD Ryzen AI NPUs Are Finally Useful Under Linux For Running LLMs — Phoronix
- Lemonade by AMD — Hacker News discussion (item 47612724)
- Local Tiny Agents with AMD NPU and iGPU Acceleration — Hugging Face MCP Course
- AMD Lemonade: A Unified API for Local AI Developers (AMD, 2026)
Recommended Gear
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →