May 31, 2026

Forge WebUI Review 2026: Faster SDXL and Flux on Less VRAM

By AIFoss · 10 min read

TL;DR: Forge is a backend rewrite of Automatic1111 that runs SDXL 30–75% faster, cuts VRAM usage significantly, and adds native Flux.1 support — all without changing the UI you already know. Switching from A1111 takes about 10 minutes. The only reasons not to: specific extensions that break, AMD hardware, or a need for ComfyUI’s automation-first pipeline model.

	Forge WebUI	Automatic1111	ComfyUI
Best for	A1111 users wanting speed + Flux	Legacy extensions, max compat	Node pipelines, automation
VRAM — Flux.1 Dev NF4	6–8 GB	Not supported	8–12 GB
VRAM — SDXL 1024px	4–6 GB	8 GB+	4–6 GB
Extension compatibility	~80% of A1111 extensions	100% (baseline)	Own ecosystem
Setup difficulty	One-click installer	One-click installer	Manual clone + deps
Speed vs A1111 (SDXL)	30–75% faster	Baseline	Comparable or faster
License	AGPL-3.0	AGPL-3.0	GPL-3.0

Honest take: If you’re on A1111 today, switch to Forge — it’s the same interface with a meaningfully faster engine and Flux.1 support that A1111 simply doesn’t have. Stick with A1111 only if you depend on extensions that are on the broken list.

What Forge Is

Stable Diffusion WebUI Forge is a fork of Automatic1111, created by lllyasviel — the same developer who built ControlNet. The goal was specific: replace A1111’s inference backend with a more efficient memory manager while keeping the existing extension ecosystem and UI intact.

The project is hosted at github.com/lllyasviel/stable-diffusion-webui-forge under an AGPL-3.0 license. The latest one-click installer package was released February 5, 2025; development on the main branch continues beyond that date. The AGPL-3.0 license is relevant if you plan to deploy Forge as a public-facing service — that triggers the copyleft clause and requires releasing your modifications.

Visually, Forge looks like A1111. The txt2img, img2img, extras, and settings panels are nearly identical. The differences are in the engine that runs underneath.

Installation

Two paths: one-click package or manual clone.

One-click package (recommended for most users): Download from the GitHub releases page. Extract, run update.bat to sync the main branch, then run.bat (Windows) or webui.sh (Linux/macOS). The primary build uses CUDA 12.1 + PyTorch 2.3.1. A CUDA 12.4 + PyTorch 2.4 build is listed as “fastest” but has reported MSVC and xformers issues on some Windows configurations.

Manual clone:

git clone https://github.com/lllyasviel/stable-diffusion-webui-forge
cd stable-diffusion-webui-forge
# Linux/macOS
bash webui.sh
# Windows: run webui-user.bat

Your existing A1111 model files work without conversion. Copy or symlink your models/Stable-diffusion/, models/Lora/, and embeddings/ directories and Forge picks them up immediately. Extension reinstallation is the main migration cost — you’ll need to reinstall from the Extensions tab, and some won’t work (more on that below).

VRAM Savings

The headline feature, and the reason most people switch, is VRAM reduction. Forge achieves this through a dynamic memory management layer that splits model layers across GPU VRAM, CPU RAM, and shared GPU memory based on what fits — rather than requiring the full model to be GPU-resident.

SDXL: A1111 requires roughly 8 GB to run SDXL at 1024×1024 without medvram flags. Forge handles it at 4–6 GB. On an RTX 3070 (8 GB), SDXL generation that was unstable or impossible in A1111 runs cleanly. On 4 GB cards, some users report successful generation at reduced batch sizes, though performance degrades significantly.

Flux.1 Dev: A1111 does not support Flux.1 models natively. Forge does, through built-in BitsandBytes NF4 and FP8 quantization. VRAM breakdown by format:

Format	Approx. GPU VRAM	Notes
FP16	24 GB+	Full precision, highest quality
FP8	11–12 GB	Good quality, CUDA 11.7+, RTX 20xx+
NF4 (BitsandBytes)	6–8 GB	Best for limited VRAM, RTX 3xxx/4xxx
GGUF Q4	~6 GB	Via GGUF extension in Forge

For an RTX 3080 (10 GB), FP8 Flux.1 Dev is the practical choice — enough VRAM headroom for comfortable generation without the quality tradeoff of NF4. For 8 GB cards, NF4 is the path: expect generation times of 60–120 seconds per image at 1024×1024.

The memory offloading works best when you have ample system RAM as the overflow target. 32 GB of system RAM is a practical floor for comfortable Flux.1 Dev usage on GPUs below 12 GB.

Generation Speed

Community benchmarks put Forge’s SDXL speed 30–75% faster than A1111, depending on hardware configuration and the number of active LoRAs. One specific benchmark on an RTX 3090 at 1024×1024 SDXL with five concurrent LoRAs clocked A1111 at 1 minute 45 seconds vs Forge at 1 minute 10 seconds — a 33% reduction. With fewer LoRAs and simpler configurations, some users report 50–75% improvements.

Against ComfyUI: a separate benchmark on an A6000 measured ComfyUI at 5.35 it/s vs Forge at 4.9 it/s for SDXL — Forge runs about 8–9% slower than ComfyUI’s optimized pipeline. This is the expected tradeoff for retaining an extension-compatible frontend.

For Flux.1 Dev on an RTX 3090 (24 GB) running FP8, expect 20–40 seconds per 1024×1024 image. On 8 GB NF4 hardware, 60–120 seconds is a realistic expectation. Neither number is impressive against cloud inference, but for local generation with no per-image cost, it’s the current state of the technology.

Extension Compatibility

Forge maintains backward compatibility with most A1111 extensions, but “most” is doing real work in that sentence.

What works reliably:

ControlNet: Built into Forge directly — no separate extension needed. The integrated version is faster than A1111’s ControlNet extension. Adding ControlNet to an SDXL generation in Forge runs 30–45% faster than A1111 + ControlNet extension, per community benchmarks.
ADetailer: Functions normally
LoRA / LyCORIS: Full support, same model format as A1111
Most prompt and aesthetic extensions: Negative prompt tools, style selectors, regional prompters

What breaks or has limitations:

Batch ControlNet operations: Forge’s integrated ControlNet is missing batch processing features from the standalone A1111 extension
Extensions that hook into A1111’s sampling pipeline at a low level: These break because Forge replaced that pipeline
Approximately 20% of A1111 extensions: Either don’t function or fail silently without error messages

The silent failure mode is the frustrating part. An extension that loads without an error but does nothing is harder to debug than a clear crash. Before switching, check the Forge Extension List and Extension Replacement List on GitHub — it documents specific incompatibilities and recommends Forge-compatible replacements for common extensions.

Forge Forks

The upstream Forge project is maintained by lllyasviel but has described itself as “experimental” since launch. Two community forks have emerged and are actively maintained in 2026:

reForge (Panchovix): Prioritizes stability and broader hardware support. Better support for older NVIDIA cards (GTX 10xx/20xx series) and AMD via DirectML. If you’re on hardware that Forge’s CUDA-centric optimizations don’t target well, reForge is worth testing first.

Forge Classic (Haoming02, formerly Forge Neo): Continues the Gradio 4 UI path with ongoing UI improvements and expanded model support. More actively maintained for UI-layer features than upstream Forge.

The upstream Forge repository remains the most referenced starting point and the one with the largest user base. But if you hit compatibility issues — particularly on AMD or older NVIDIA hardware — both forks are production-grade alternatives.

When Not to Use Forge

You depend on extensions that break. The ~20% incompatibility rate isn’t theoretical — specific extensions that users rely on for workflows (certain custom samplers, some preprocessors, some inpainting tools) fail in Forge. If your critical extension is on the broken list, A1111 stays the safer choice until Forge compatibility improves or a replacement extension is available.

You need ComfyUI’s pipeline automation. Forge is a UI-first tool. It doesn’t expose a comparable API to ComfyUI’s workflow JSON system. Batch generation, programmatic prompt chaining, and automated node pipelines are ComfyUI territory. For a full breakdown of where each tool wins, see the ComfyUI vs Automatic1111 vs Forge comparison.

You’re on AMD hardware. Forge’s VRAM optimizations are CUDA-centric. AMD users should evaluate reForge (DirectML) or ComfyUI with ROCm before defaulting to upstream Forge.

Commercial deployment with AGPL concerns. AGPL-3.0 requires releasing modifications if you run Forge as a public-facing service. ComfyUI is GPL-3.0 with the same implication. For commercial SaaS use, verify your legal exposure before deploying either.

The Flux.1 Context

Forge is currently the most accessible entry point for running Flux.1 Dev locally, particularly for 8–12 GB VRAM users. ComfyUI supports Flux.1 as well and handles it better for custom multi-step pipelines — but ComfyUI’s node-based setup has a steeper initial learning curve.

For a detailed look at how Flux.1 compares to SDXL and SD 3.5 in terms of quality, speed, and VRAM demands, see the Flux vs SDXL vs SD 3.5 model comparison. For users deciding which models to run on 8 GB cards specifically, the Stable Diffusion on 8GB VRAM guide covers the tradeoffs in more depth.

If you’re evaluating Flux.1 Dev at full FP16 quality but don’t own a 24 GB card, RunPod offers on-demand RTX 4090 and A100 instances with pre-configured Forge templates — practical for testing quality before committing to a hardware purchase.

Frequently Asked Questions

Is Forge WebUI better than Automatic1111 in 2026? For most users, yes. Forge runs SDXL 30–75% faster, uses less VRAM, and adds native Flux.1 support that A1111 doesn’t have. The gap has widened since Forge launched because Flux.1 became the dominant new model family and A1111 still has no native support for it. The only case for staying on A1111 is specific extension dependencies that break in Forge.

Can Forge run Flux.1 Dev on an 8GB GPU? Yes, using BitsandBytes NF4 quantization. You need an RTX 3000 series or newer GPU (CUDA 11.7+), 32 GB of system RAM as the overflow buffer, and patience — expect 60–120 seconds per image at 1024×1024. Quality is good but not equivalent to FP16. The NF4 model file is approximately half the size of FP8.

Do A1111 extensions work in Forge? Roughly 80% do. Extensions that hook into A1111’s low-level sampling pipeline are the most likely to break, often without clear error messages. Check the Forge Extension List on GitHub before migrating if you have critical workflow extensions.

How does Forge compare to ComfyUI for beginners? Forge is easier to start with if you’re coming from A1111 — the UI is nearly identical. ComfyUI has a steeper initial setup but is the better long-term choice for automated or complex multi-model workflows. For pure image generation with minimal configuration, Forge wins on accessibility.

Is Forge WebUI free to use commercially? Forge is AGPL-3.0 licensed. Personal use and internal tools have no restrictions. If you deploy Forge as a public service, you’re required to release the source code for any modifications. This applies equally to A1111. For commercial SaaS use cases, verify the licensing implications with a lawyer before shipping.

Sources

Recommended Gear

RTX 3070 8GB — minimum comfortable SDXL card; runs Flux.1 Dev NF4
RTX 3080 10GB — hits the FP8 Flux.1 Dev sweet spot
RTX 3090 24GB — runs Flux.1 Dev FP16 fully GPU-resident

Was this article helpful?