Jun 7, 2026

AMD Lemonade Review 2026: GPU, NPU, and Multi-Modal

By AIFoss · 10 min read

amdllmselfhostednpuollama

TL;DR: Lemonade v10.6 is AMD’s open-source LLM server that adds NPU prefill acceleration, image gen, and speech to one OpenAI-compatible endpoint. NPU acceleration works only on Ryzen AI 300/400 chips — on other hardware, Ollama’s ecosystem is wider. AMD Ryzen AI users should pick Lemonade; everyone else should consider Ollama first.

	Lemonade v10.6	Ollama v0.6	LocalAI
Best for	AMD GPU + NPU hybrid, multi-modal	Cross-platform, broadest ecosystem	OpenAI API proxy, any hardware
Install	`winget` or Snap	`curl` one-liner	Docker Compose
Hardware	AMD RDNA3+, NVIDIA, Apple M, CPU	Any GPU	Any hardware
Model formats	GGUF, ONNX, FLM, SafeTensors	GGUF (Ollama manifest)	GGUF, OpenVINO, more
Multi-modal	LLM + image gen + Whisper + TTS	LLM + vision models	LLM + Whisper + SD
The catch	NPU only on Ryzen AI 300/400	No NPU acceleration	High setup complexity

Honest take: On a Ryzen AI 300-series machine, Lemonade is the better daily driver — it uses hardware that Ollama leaves idle and bundles image gen plus speech in one package. On Nvidia hardware or wherever you need maximum integration coverage, stick with Ollama.

What Lemonade Is and Why AMD Built It

Ollama solved cross-platform local LLM deployment cleanly. But it left AMD NPU owners with idle hardware — the dedicated AI accelerators in Ryzen AI chips sat unused because Ollama has no FastFlowLM backend.

Lemonade is AMD’s answer. Released under Apache 2.0 and available at github.com/lemonade-sdk/lemonade, it bundles:

An OpenAI-compatible HTTP API at http://localhost:13305/v1
llama.cpp with Vulkan backend for AMD and NVIDIA GPUs
FastFlowLM for XDNA2 NPU acceleration on Ryzen AI chips
Stable Diffusion image generation
Whisper speech-to-text
Kokoro text-to-speech
A model manager with one-command downloads from Hugging Face

The core design difference from Ollama is hardware-tier splitting. On a Ryzen AI 300-series chip, prompt processing (prefill) goes to the NPU while token generation (decode) goes to the iGPU. This is not marketing — the NPU has better compute throughput for dense matrix math during prefill, and the iGPU has better memory bandwidth for sequential token generation. The result is lower Time to First Token on long system prompts and agentic chains.

Current version: v10.6.0 (released May 21, 2026). Linux NPU support shipped with Lemonade 10.0 in March 2026 via the FastFlowLM runtime.

Hardware Compatibility

Platform	Backend	Notes
AMD Ryzen AI 300/400 (XDNA2)	FastFlowLM NPU + Vulkan iGPU	Strix Halo supports up to 128 GB unified memory
AMD Radeon discrete (RDNA2/3/4)	llama.cpp + Vulkan	Standard VRAM limits; add 2–4 GB overhead
NVIDIA (Turing–Blackwell)	llama.cpp + Vulkan or CUDA	CUDA backend available since v10+
Apple Silicon (M1–M4)	Metal via llama.cpp	Unified memory; M4 Max competitive at large models
x86_64 CPU	llama.cpp CPU	Small models only; no hardware acceleration

NPU acceleration requires Ryzen AI 300-series or 400-series specifically — the XDNA2 architecture. Earlier Ryzen AI chips (7000, 8000, 200-series) have NPUs that no current runtime supports for LLM inference. On those systems, Lemonade falls back to Vulkan on the GPU, which is functionally the same as running Ollama.

Supported Linux distros: Ubuntu 24.04+, Fedora 43+, Debian Trixie+, Arch. Docker and Snap packages are available. For hardware context on AMD GPU builds, see runaihome.com for current RDNA4 GPU benchmarks and build guides.

Installation

Windows

winget install AMD.LemonadeServer

This installs the server and a Tauri desktop app (system-tray GUI for model downloads and server management). Alternatively, grab the .msi from the GitHub releases page. After install, the server starts automatically on port 13305.

Linux (Ubuntu 24.04+)

# Snap — works across Ubuntu 24.04+, Fedora 43+, Arch
sudo snap install lemonade

# Docker
docker run -d --gpus all -p 13305:13305 lemonadesdk/lemonade:latest

For NPU support on Linux, you need the XDNA driver and FastFlowLM runtime installed separately — the Lemonade docs cover the dependency chain. It is more involved than the Windows path. For most Linux users without a Ryzen AI 300/400 chip, the Snap install with Vulkan fallback is the practical path.

Verify the server is running

curl http://localhost:13305/v1/models

Expected output on a fresh install with no models downloaded:

{"object":"list","data":[]}

Running Your First Model

lemonade run Gemma-4-E2B-it-GGUF

This pulls the model from Hugging Face (if not cached) and starts a chat session in your terminal. The model manager uses Hugging Face slug format — you can also import any custom GGUF or ONNX model from Hugging Face directly.

Check which backend Lemonade selected for your hardware:

curl http://localhost:13305/stats

The response includes the active inference engine: vulkan, fastflowlm, rocm, or cpu. If you expected fastflowlm and got vulkan, check that your XDNA driver is installed and you’re on a Ryzen AI 300/400 chip.

To test image generation:

curl http://localhost:13305/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{"prompt": "a terminal screen in a dark room", "n": 1}'

NPU + GPU Hybrid: Numbers From Real Hardware

On a Ryzen AI Max+ 395 (Strix Halo, 128 GB unified memory), the NPU handles prompt processing and the iGPU handles decode. Community benchmarks from May–June 2026 on this configuration:

Model	Quantization	Tokens/sec
GPT-OSS 120B	Q4_K_M	~50 tok/s
Qwen3.5-122B	Q4	~35 tok/s
Qwen3-Coder-Next	Q4	~43 tok/s

For comparison: an RTX 4090 running llama.cpp hits 50–80 tok/s on 7B models at Q4 but stalls on 70B+ without aggressive quantization (limited by 24 GB VRAM). The Strix Halo runs 120B at full Q4_K_M in 128 GB of unified memory — a different tier of capability.

On smaller Ryzen AI 300 systems (Strix Point, 32–64 GB), expect:

Llama 3.2-3B on NPU: ~28 tok/s at under 2 W
Models above 8B: fall back to iGPU via Vulkan

FastFlowLM 0.9.35, the current NPU runtime bundled in Lemonade 10.6, supports context windows up to 256k tokens on XDNA2 NPUs.

Lemonade bundles three additional inference backends behind the same API port:

Image generation: SDXL-Turbo via /v1/images/generations. Any client that supports the OpenAI image endpoint works — including the ComfyUI API adapter. See our ComfyUI API tutorial for chaining this into automated pipelines.

Speech-to-text: Whisper backend via /v1/audio/transcriptions. Uses the same model weights as whisper.cpp.

Text-to-speech: Kokoro TTS via /v1/audio/speech. Known limitation as of v10.6: voices not in the pre-configured list produce muted audio. Custom voice loading is not yet supported.

Running these three modalities as separate services (Ollama + ComfyUI + a Whisper server) adds coordination overhead — three processes, three ports, three model caches. Lemonade consolidates them into one service with one model manager. For a home server running all three, that’s meaningful.

Connecting to Open WebUI

Open WebUI supports custom OpenAI-compatible endpoints. To add Lemonade:

Open WebUI settings → Connections → Add Connection
API URL: http://localhost:13305/v1
API key: leave blank (Lemonade does not validate keys)
Save and confirm models appear in the model list

If you’re running Open WebUI in Docker and Lemonade natively on the host:

http://host.docker.internal:13305/v1

The rest of the setup is identical to the Ollama path — see our Ollama + Open WebUI on Linux guide for the full stack. Swap the backend URL and everything else carries over.

Continue (the VS Code AI extension) supports Lemonade the same way: set a custom OpenAI base URL in .continue/config.json. Dify and n8n both have validated connectors for OpenAI-compatible endpoints — Lemonade drops in without extra configuration.

When NOT to Use Lemonade

On pre-300-series AMD hardware. Ryzen AI 7000, 8000, and 200-series NPUs aren’t supported for LLM inference by any current runtime. Lemonade falls back to CPU or Vulkan GPU. At that point, Ollama’s wider model library and ecosystem make it the better choice.

When ecosystem breadth matters. Ollama has ~95,000 GitHub stars and integrations across LangChain, LlamaIndex, Open WebUI plugins, VS Code extensions, Raycast, and hundreds of community tools. Lemonade’s validated integrations cover the main ones but are narrower. Expect rough edges in any integration not explicitly listed in the docs.

For production multi-user workloads. Neither Lemonade nor Ollama are designed for high-concurrency serving with proper auth, rate limiting, and health checks. For that, vLLM is the right tool — see our vLLM production setup guide. If you want to evaluate cloud vs. self-hosted for production GPU workloads, RunPod is worth benchmarking against a local setup before committing to hardware.

When the model you need isn’t in Lemonade’s manager. Ollama maintains a curated library with version tags and tested compatibility. Lemonade pulls from Hugging Face generically — more flexible, but you will hit format or compatibility issues with obscure models that Ollama’s curation catches before publishing.

For understanding quantization trade-offs when selecting models, see our GGUF quantization guide.

FAQ

Does Lemonade work on Windows 10? No. Windows 11 is required. AMD’s Ryzen AI NPU drivers are Windows 11-only, and the desktop app requires it as well.

Can I run Lemonade and Ollama simultaneously? Yes — they use different ports (Lemonade: 13305, Ollama: 11434). Running both at once is a reasonable setup: Lemonade for NPU-accelerated models and image gen, Ollama for models with better library coverage.

Does Lemonade support LoRA adapters? Not through the model manager API as of v10.6.0. The GGUF backend inherits llama.cpp’s command-line LoRA support, but there’s no model-manager UI for LoRA loading. It’s on the roadmap per GitHub issues.

What is the difference between Lemonade’s AMD GPU path and ROCm? Lemonade defaults to llama.cpp + Vulkan for AMD GPUs, which works on all RDNA cards without a full ROCm installation. The experimental vllm:rocm backend is available for Ryzen AI Max+ on Linux but requires the full ROCm stack. For most users, Vulkan is the practical path.

Is NPU inference output deterministic? FastFlowLM on XDNA2 produces deterministic output within the same model version and quantization. Switching quantization levels or model versions will produce different outputs, same as any other inference backend.

Sources

Recommended Gear

Was this article helpful?