Jun 6, 2026

Flux vs SDXL vs SD 3.5: Which to Run in 2026

By AIFoss · 16 min read

stablediffusionfluxsdxlimagegenerationgpuopensource

TL;DR: Your GPU’s VRAM determines which image model you can actually run — benchmarks don’t matter if the model won’t fit in memory. SDXL is the 8GB choice with the widest LoRA ecosystem; SD 3.5 Medium earns its place at 6–10GB with better text rendering and a workable commercial license; Flux.1 Dev leads on raw quality at 24GB but is restricted to non-commercial use only.

	SDXL 1.0	SD 3.5 Medium	Flux.1 [dev]
Best for	LoRAs, fine-tunes, widest tooling	Consumer GPU, text-in-image	Maximum quality, research
Min VRAM	8GB (comfortable)	~6GB	6GB (GGUF Q4)
Full-precision VRAM	~12GB	~10GB	24GB
Commercial license	Yes (CreativeML RAIL++-M)	Yes (up to $1M revenue)	No
LoRA ecosystem	Massive	Growing	Growing
Steps (typical)	20–30	25–40	20–50

Honest take: On an 8GB GPU, use SDXL. On 10–16GB, SD 3.5 Medium or Flux.2 Klein 4B depending on whether you need a commercial license and step count. On 24GB, Flux.1 Dev is the default unless you specifically need text rendering, where SD 3.5 Large wins.

The Hardware Question Comes First

Most Flux-vs-SDXL articles lead with quality comparisons. That framing is backwards. The model that scores best in abstract benchmarks is irrelevant if it requires 24GB and you have 12GB. Before you compare quality, figure out what your GPU can actually run without quantization tricks destroying the advantage.

The models in scope here, as of June 2026:

Flux.1 [dev] — 12B parameters, Black Forest Labs, released August 2024. The quality reference in open-source image generation. Non-commercial license.
Flux.1 [schnell] — Same architecture, distilled to 4 inference steps. Apache 2.0. The commercial-safe Flux option before Klein arrived.
Flux.2 Klein 4B — Released January 15, 2026. 4B parameters, Apache 2.0, 4 steps. The current practical recommendation for commercial Flux work under 16GB VRAM.
SDXL 1.0 — 3.5B parameters, Stability AI, July 2023. The widest LoRA and extension ecosystem in open-source image generation.
SD 3.5 Medium — 2.5B parameters, Stability AI, October 2024. The sweet spot for consumer hardware with strong text-in-image performance.
SD 3.5 Large — 8.1B parameters, Stability AI. Higher quality than Medium, but requires 16–18GB at FP16.

VRAM and Speed Reference Table

Community benchmarks at 1024×1024, ComfyUI, no batch, no torch.compile unless noted. Numbers vary by driver version, software optimization, and step count.

Model	Parameters	Min VRAM	Comfortable	Speed (RTX 4090)	Steps
Flux.1 [dev]	12B	6GB (GGUF Q4)	16GB (FP8)	~18s/img	20–50
Flux.1 [schnell]	12B	6GB (GGUF Q4)	16GB (FP8)	~8s/img	4
Flux.2 Klein 4B	4B	~10GB (Q8)	13GB (FP16)	~6s/img	4
SDXL 1.0	3.5B	8GB	10–12GB	~3–4s/img	20–30
SD 3.5 Medium	2.5B	6GB	8–10GB	~5s/img	25–40
SD 3.5 Large	8.1B	11GB (FP8)	16–18GB	~12s/img	25–40

RTX 4090 Flux.1 Dev speed (~18 seconds per image at FP16) from ComfyUI community benchmarks. SDXL speed (~3.2–4 seconds at 20 steps on RTX 4090) from Salad benchmark data. Speed on an RTX 3090 for Flux.1 Dev runs approximately 25–35 seconds at FP16; SDXL on a 3090 generates in roughly 5–7 seconds at 20 steps.

If You Have 8GB VRAM

An RTX 3060 Ti, RTX 4060, or AMD RX 7600 all land at 8GB. SDXL was designed for this tier and runs well without workarounds.

Set up ComfyUI with SDXL, enable xformers, and you get 1024×1024 images in 10–20 seconds at 20 steps. The LoRA catalogue on Civitai alone has tens of thousands of style-specific fine-tunes. ControlNet, AnimateDiff, IP-Adapter — all mature and well-documented at this tier. See the 8GB VRAM image generation guide for the specific flags and memory workarounds that matter on this GPU class.

SD 3.5 Medium is technically possible at 8GB with CPU offloading enabled in ComfyUI. The experience degrades: decode times slow noticeably, and the VAE and T5-XXL text encoder compete for memory with the main model. At 8GB, it’s a compromise you don’t need to make when SDXL runs cleanly.

Flux.1 Dev at 8GB requires GGUF Q4 quantization. It works, and you can find guides showing it running on a 3060, but generation takes 60+ seconds and Q4 quantization visibly degrades fine detail — faces, textures, and fine linework. The quality advantage Flux Dev has over SDXL shrinks considerably at Q4. If you need Flux-level quality output occasionally and don’t want to quantize it, a RunPod A100 instance for a few hours is cheaper than the quality tax of Q4 on consumer hardware.

8GB verdict: SDXL. No contest.

If You Have 10–16GB VRAM

This covers the RTX 3080 10GB, RTX 3080 Ti 12GB, RTX 4070, RTX 4070 Ti, and the newer RTX 5070 Ti 16GB. This is the most contested tier because multiple models run well here and the right choice depends on what you’re actually building.

SD 3.5 Medium is the general-purpose answer. At 10–12GB it runs at or near full precision without offloading. Quality beats SDXL on prompt adherence, spatial composition, and — critically — text rendered inside images. The triple text encoder (T5-XXL + CLIP-L + OpenCLIP-G) produces coherent on-image text at a reliability SDXL simply cannot match. The Stability AI Community License allows commercial use up to $1M in annual revenue, which covers most independent developers and small teams.

The friction: SD 3.5 Medium’s LoRA ecosystem is substantially smaller than SDXL’s, and its MMDiT architecture is not compatible with SDXL’s UNet-based ControlNet and AnimateDiff implementations. Some workflows built around SDXL don’t have direct SD 3.5 equivalents yet.

Flux.2 Klein 4B is the stronger option if you want Flux-lineage quality and need Apache 2.0 licensing. Released January 15, 2026, it runs in approximately 13GB at FP16, generates in 4 inference steps, and produces output quality between Schnell and Dev on complex prompts. It’s newer, so fine-tune support and ControlNet integrations are still developing — but if you need a Flux-family model that you can deploy commercially without license friction, Klein 4B is the current answer.

Flux.1 Dev in FP8 fits at 12–16GB and produces noticeably better outputs than Schnell on prompts with complex spatial relationships. The constraint is the license: non-commercial only. It’s valid for personal work, research, and building internal tooling that won’t face end users commercially.

10–16GB verdict: SD 3.5 Medium for most commercial work; Flux.2 Klein 4B if you want better quality with Apache 2.0; Flux.1 Dev FP8 for non-commercial personal or research work.

If You Have 24GB VRAM

An RTX 3090, RTX 4090, or RTX 5080 runs everything at full or near-full precision. The question shifts from capability to use case.

For photorealism, multi-subject scenes, facial detail, and general-purpose best-quality generation, Flux.1 Dev at FP16 is the default. It generates in ~18–20 seconds per image on a 4090 at 50 steps. At 20 steps you get results in roughly 8–10 seconds with minimal quality loss — useful for draft iteration. The non-commercial license is the one constraint to know before you start.

SD 3.5 Large is worth reaching for in specific cases: images where text must appear inside the output, highly compositional scenes where the T5-XXL encoder matters, or when you want to fine-tune and need a model with a commercial-friendly license. At FP16 it uses ~18GB of your 24GB, leaving limited headroom for VAE and text encoders. NVIDIA’s TensorRT integration for SD 3.5 Large delivers a 2.3× speedup at FP8 precision with ~40% memory reduction — at TensorRT FP8 it uses roughly 11GB and generates in 5–6 seconds per image. AMD GPU users get no benefit from TensorRT.

24GB verdict: Flux.1 Dev for most use cases; SD 3.5 Large when text rendering or fine-tuning requirements push you toward Stability’s licensing model.

Licensing: The Section Most Comparisons Skip

This matters if you’re building something that will ship.

Model	License	Commercial use
Flux.1 [dev]	FLUX.1-dev Non-Commercial License v2.0	No (testing/research only)
Flux.1 [schnell]	Apache 2.0	Yes
Flux.2 Klein 4B	Apache 2.0	Yes
SDXL 1.0	CreativeML OpenRAIL++-M	Yes (with use restrictions)
SD 3.5 Medium	Stability AI Community License	Yes (up to $1M revenue)
SD 3.5 Large	Stability AI Community License	Yes (up to $1M revenue)

The CreativeML RAIL++-M on SDXL includes an explicit prohibited use list — generating certain types of illegal content, among other restrictions — and attribution requirements. For standard commercial image applications, it’s workable. If you’re in a regulated industry or need a clean Apache/MIT-style grant, verify the full text.

The Stability AI Community License for SD 3.5 is the most permissive of the three: free for commercial use until $1M annual revenue, after which you need to contact Stability AI for an enterprise agreement. For small teams and solo developers this is effectively free commercial use.

Flux.1 Dev’s restriction catches people off guard. The nuance: generated outputs can be used commercially even under the non-commercial license. The restriction applies to deploying the model in a product or service — not to images you generate for personal use and then sell. If you’re generating images for your own portfolio or client work without running the model as a service, read the full license text at bfl.ai to determine whether your use case qualifies.

Quality by Use Case

Abstract quality scores don’t map to real-world tasks. Here’s how the models rank on specific use cases:

Photorealism (portraits, product shots): Flux.1 Dev > SD 3.5 Large > SDXL > SD 3.5 Medium

Stylized art and illustration: SDXL (LoRA availability dominates) > Flux.1 Dev > SD 3.5 Medium

Text rendered inside images: SD 3.5 Medium ≈ SD 3.5 Large >> Flux.1 Dev > SDXL

Inpainting workflows: SDXL (dedicated inpainting model) ≈ SD 3.5 Medium; Flux.1 Dev inpainting exists in ComfyUI workflows but is less mature

ControlNet integration: SDXL >> SD 3.5 Medium > Flux.1 Dev (InstantX Flux ControlNet is available but covers fewer preprocessors than SDXL’s ecosystem)

Text rendering is SD 3.5’s clearest advantage. If your use case involves generating images that contain readable text — product packaging mockups, social graphics, UI screenshots — SD 3.5 Medium or Large will outperform Flux on this dimension reliably.

Setup: Commands That Work

SDXL in ComfyUI — download the base checkpoint and drop it into models/checkpoints/:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI && pip install -r requirements.txt

huggingface-cli download stabilityai/stable-diffusion-xl-base-1.0 \
  sd_xl_base_1.0.safetensors \
  --local-dir models/checkpoints/

python main.py --gpu-only

Flux.1 Dev FP8 in ComfyUI — use the FP8 checkpoint (~12GB) to fit in 12–16GB. Note: load via UNETLoader node, not CheckpointLoader:

# Download FP8 flux dev checkpoint
huggingface-cli download Kijai/flux-fp8 flux1-dev-fp8.safetensors \
  --local-dir models/diffusion_models/

# In ComfyUI: use the UNETLoader node (not CheckpointLoader) to load the model.
# CheckpointLoader is for SDXL-style combined checkpoints and skips proper
# memory offloading for Flux.

SD 3.5 Medium via diffusers:

from diffusers import StableDiffusion3Pipeline
import torch

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3.5-medium",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")

image = pipe(
    "A photorealistic tabby cat reading a folded newspaper, studio lighting",
    num_inference_steps=28,
    guidance_scale=4.5,
).images[0]
image.save("output.png")

Expected runtime: ~20–25 seconds on an RTX 3080 10GB at 1024×1024, bfloat16. If you hit OOM on the VAE decode step, add pipe.enable_model_cpu_offload() before pipe.to("cuda").

For automating any of these models in a pipeline without a browser GUI, the ComfyUI API tutorial covers how to queue workflows via Python and WebSocket without touching the GUI.

Common Problems and Fixes

Flux.1 Dev OOM on 12–16GB: You’re likely loading FP16. Switch to the FP8 checkpoint from Kijai/flux-fp8 on HuggingFace. Also verify you’re using the UNETLoader node in ComfyUI — the standard CheckpointLoader pathway expects SDXL-style combined weights and doesn’t apply Flux’s memory offloading correctly, which causes apparent OOM at step boundaries rather than at load time.

SD 3.5 Medium VAE decode crash on 8GB: SD 3.5 uses a 16-channel VAE that’s more memory-intensive than SDXL’s 4-channel VAE. At 8GB, the VAE decode competes with the residual model weight for memory. Generate at 768×768 and upscale, or load the VAE with explicit dtype: AutoencoderKL.from_pretrained(..., torch_dtype=torch.float16). In Forge WebUI, pass --opt-sdp-no-mem-attention in addition to --medvram.

SDXL refiner step ignored: The SDXL refiner (sd_xl_refiner_1.0) is a separate model. Load both base and refiner, run the base for the first ~80% of steps, then hand off latents to the refiner for the final 20%. In ComfyUI this is the standard Base+Refiner workflow. Using the refiner as a standalone checkpoint at 100% steps produces worse results than just running the base alone.

Flux.1 Schnell outputs look flat: Schnell at 4 steps has weaker preference alignment than Dev — it generates fast but produces more average-looking outputs. Push steps to 6–8 and use a slightly higher CFG guidance value (try 3.5–4.5 instead of the default 1.0). Quality improves considerably with those extra steps at the cost of generating in ~12 seconds instead of 8.

When NOT to Use Each

Don’t use Flux.1 Dev if you’re building a product or API endpoint. The non-commercial license is a hard block for production deployments. Use Flux.1 Schnell (Apache 2.0) or Flux.2 Klein 4B (Apache 2.0) for anything that ships.

Don’t use SDXL when raw output quality on complex prompts is the priority and you have at least 12GB VRAM available. SDXL’s UNet architecture has real limitations on spatial coherence and multi-subject prompt adherence that Flux’s transformer approach doesn’t share. Use SDXL for its ecosystem — LoRAs, ControlNet depth, AnimateDiff — not as the best-quality model.

Don’t use SD 3.5 Large on a 12GB GPU without first enabling FP8 quantization. At native FP16 it needs 16–18GB and will crash or heavily page to CPU without quantization. The TensorRT speedup also only benefits NVIDIA hardware — AMD users get none of those gains and should weigh whether SD 3.5 Large is worth the setup.

Don’t use SD 3.5 Medium if ControlNet or AnimateDiff workflows are central to what you’re building. SD 3.5’s MMDiT architecture is architecturally incompatible with SDXL’s UNet-based extensions. ControlNet adapters for SD 3.5 exist, but the variety and community testing are much thinner than SDXL’s library.

The Practical Decision Tree

8GB GPU → SDXL
10–16GB GPU, need commercial license → SD 3.5 Medium (general) or Flux.2 Klein 4B (Flux quality, 4 steps)
10–16GB GPU, personal or research work → Flux.1 Dev FP8
24GB GPU, best quality → Flux.1 Dev FP16
24GB GPU, need text in images or commercial fine-tuning → SD 3.5 Large
Any GPU, want to test all models before committing → RunPod A100 instance gives you 40–80GB VRAM to run every model at full precision side-by-side

If you’re evaluating GPU hardware for a dedicated image generation workstation, the runaihome.com guide to GPU builds for local AI covers the RTX 3090 vs 4090 vs 5090 tradeoffs in detail.

FAQ

Can I use Flux.1 Dev images in my commercial portfolio or client work?
The FLUX.1-dev Non-Commercial License restricts deploying the model in a product or service — not selling the images it produces for personal use. If you’re generating images on your own machine for your own clients, not running an API, read the full license text at bfl.ai. The generated output licensing is intentionally more permissive than the model deployment licensing. When in doubt, switch to Flux.1 Schnell (Apache 2.0) and remove any ambiguity.

Is SD 3.5 Medium actually better than SDXL?
At raw output quality — especially prompt adherence and text rendering — yes. At practical day-to-day usability, SDXL still wins on ecosystem depth: more LoRAs, more ControlNet preprocessors, more AnimateDiff motion modules, more community troubleshooting resources. SD 3.5 Medium produces better images out of the box; SDXL gives you more tooling to iterate around it.

What is Flux.2 Klein and should I switch?
Flux.2 Klein 4B was released on January 15, 2026 by Black Forest Labs. At 4B parameters it runs in ~13GB at FP16, generates in 4 inference steps, and is Apache 2.0 licensed. Quality sits between Schnell and Dev on complex prompts. If you’re on 12–16GB and need commercial-safe Flux, Klein 4B is the current first recommendation. On a 24GB GPU already running Dev non-commercially, Klein 4B is a sidegrade — not an upgrade.

Does SD 3.5 work with AnimateDiff?
Not with the standard AnimateDiff motion modules, which were built for SDXL’s UNet. SD 3.5’s MMDiT architecture requires different motion modules. Community projects are working on native SD 3.5 video, but as of June 2026 AnimateDiff workflows should remain on SDXL.

Which model wins for inpainting?
SDXL has a dedicated inpainting checkpoint (sd_xl_inpaint_0.1) that produces clean results out of the box. SD 3.5 Medium handles inpainting via the standard img2img pipeline with masking — works well but lacks the dedicated inpaint training that SDXL’s checkpoint has. Flux.1 Dev inpainting workflows exist in ComfyUI through the Redux model approach but are less mature. For production inpainting workloads, SDXL is still the most stable option.

Sources

Recommended Gear

RTX 3060 — 12GB VRAM, practical entry point for Flux FP8 and SDXL
RTX 3080 — 10GB, runs SD 3.5 Medium and Flux FP8 cleanly
RTX 4070 — 12GB, the 2026 sweet spot for image generation budget builds
RTX 5070 Ti — 16GB, runs every model in this guide except Flux Dev at full FP16
RTX 3090 — 24GB VRAM, the used-market pick for Flux Dev at full precision
RTX 4090 — 24GB, fastest single-GPU option for Flux Dev and SD 3.5 Large
RTX 5080 — 16GB, newer architecture with strong FP8 throughput

Was this article helpful?