May 22, 2026

Stable Diffusion on 8GB VRAM 2026: SDXL vs Flux Guide

By AIFoss · 11 min read

stablediffusionaiimagegenerationgpuopensource

Eight-gigabyte GPUs cover most of the consumer market — RTX 3060 Ti, 3070, 4060, 4060 Ti (8GB variant), AMD RX 6700 XT, RX 7600. They’re powerful enough for real image generation work, but most tutorials assume a 24GB workstation card and skip over the workarounds that actually matter at this memory tier.

This is specifically for 8GB VRAM. You’ll get SDXL and Flux.1 running in ComfyUI v0.22.0, with the flags, model choices, and workflow nodes that actually work — along with an honest comparison of what you’re giving up against a larger GPU.

Cards this applies to

GPU	VRAM	Notes
RTX 3060 Ti	8 GB	GDDR6, handles SDXL fine
RTX 3070 / 3070 Ti	8 GB	Faster than 3060 Ti, same VRAM ceiling
RTX 4060	8 GB	Efficient; good power-to-VRAM ratio
RTX 4060 Ti (8GB)	8 GB	Not the 16GB variant
RX 7600	8 GB	AMD ROCm; same constraints, some caveats below

The RX 6700 XT has 12 GB — if you have one, you’re above the 8GB floor and some of the tighter workarounds below are optional for you.

AMD note: ComfyUI’s GGUF quantization extension (city96/ComfyUI-GGUF) has weaker ROCm support than CUDA. If you’re on AMD, the SDXL path is more reliable than Flux GGUF for now.

What fits in 8GB out of the box

This table is the starting point. “VRAM (inference)” means peak usage during generation, not just model size on disk.

Model	Precision	Peak inference VRAM	8GB viable?	What’s needed
SDXL 1.0 (base)	FP16	~7–8 GB	Tight	`--medvram` + VAE fix
SDXL 1.0 (pruned)	FP16	~5.5–6.5 GB	Yes	Standard flags
Flux.1 Dev	FP16	~23 GB	No	Needs 24GB+
Flux.1 Dev	GGUF Q5_K_S	~7.5–8 GB	Yes	ComfyUI-GGUF extension
Flux.1 Dev	GGUF Q4_0	~6.5–7 GB	Yes	ComfyUI-GGUF; quality tradeoff
Flux.1 Schnell	GGUF Q5_K_S	~7.5–8 GB	Yes	ComfyUI-GGUF; Apache 2.0 licensed

Full FP16 Flux is not viable under 12GB regardless of flags. SDXL with the pruned weights and Flux GGUF quantized to Q5 are the two practical paths.

ComfyUI setup for 8GB

The ComfyUI review covers the full installation. The relevant part here is the launch command — on 8GB cards you need the right flags before you load a single model.

Start with:

python main.py --medvram --force-fp16

If you still hit OOM errors, escalate to:

python main.py --lowvram --force-fp16

What each flag does:

--medvram: offloads the text encoder to CPU during generation, freeing ~1–2 GB on the GPU. Mild speed penalty. The right starting point for 8–12 GB cards.
--lowvram: moves model components off the GPU between operations. Slower than --medvram due to the constant CPU-GPU transfers, but fits tighter memory budgets.
--force-fp16: loads models that default to FP32 in half precision instead. Halves their VRAM footprint with minimal quality impact.
--cpu-vae: if you’re still crashing specifically at the VAE decode step (the final image render), this offloads the VAE to CPU entirely. Adds 10–30 seconds but prevents the crash.

On Windows using the portable ComfyUI install, add flags to run_nvidia_gpu.bat:

.\python_embeded\python.exe -s ComfyUI\main.py --medvram --force-fp16 --windows-standalone-build

Pick one attention optimization at a time — xFormers, Flash Attention, SageAttention, or attention slicing. Running multiple simultaneously causes unexpected behavior.

SDXL on 8GB VRAM

Model and license

SDXL 1.0 is released under the CreativeML OpenRAIL++-M license — restricted for certain commercial uses at the base weights level, though many community-fine-tuned derivatives carry their own licenses. Check before commercial deployment.

The base model is 6.9 GB on disk. The pruned version removes the EMA weights and runs at ~4.7 GB on disk, with substantially lower peak VRAM during inference. Use the pruned version on 8GB. The full base model will push you into --lowvram territory even with other flags.

Pruned SDXL base weights are widely available on HuggingFace and CivitAI under the original Stability AI release or as community repacks.

Fix the VAE before generating

The default SDXL VAE causes an out-of-memory crash specifically at the VAE decode step when generating at 1024×1024 on 8GB cards. The fix is the sdxl-vae-fp16-fix model (~319 MB), a precision-corrected VAE that avoids this failure mode.

Download it from HuggingFace (madebyollin/sdxl-vae-fp16-fix) and place it in ComfyUI/models/vae/. In your workflow, add a VAE Loader node pointing to this file, then wire it into your VAE Decode node instead of using the checkpoint’s bundled VAE.

Without this fix: 768×768 generates fine; 1024×1024 OOMs at decode maybe half the time.

Resolution

SDXL’s training resolution is 1024×1024. Going below that degrades output quality — it wasn’t designed for 512×512 the way SD 1.5 was. On 8GB with the pruned model and VAE fix:

1024×1024: works with --medvram --force-fp16
1280×768 or 768×1280 (landscape/portrait): fine, similar VRAM
Above 1536 on any edge: expect OOM or very slow generation

For final upscaled output, generate at 1024×1024 and run an upscale workflow afterward rather than generating large natively.

A working SDXL node graph

Standard ComfyUI workflow for SDXL:

Load Checkpoint → pruned SDXL weights
CLIP Text Encode × 2 → positive and negative prompt
Empty Latent Image → 1024×1024, batch size 1
KSampler → 20–25 steps, DPM++ 2M Karras, CFG 7
VAE Loader → sdxl-vae-fp16-fix
VAE Decode → latent from KSampler, VAE from VAE Loader
Save Image

The ComfyUI custom nodes guide covers node packs worth adding on top of this baseline.

Flux.1 on 8GB VRAM

Flux.1 is a large diffusion transformer from Black Forest Labs. At FP16 it’s not remotely close to fitting in 8GB. GGUF quantization — the same format used for LLM compression — brings it into range.

Licenses

Flux.1 Dev: FLUX.1-dev Non-Commercial License. Personal use, research, education. Cannot train competing models on it. Outputs can be used commercially, with restrictions.
Flux.1 Schnell: Apache 2.0. Fully open for commercial use.

If you need commercial licensing for generation work, Schnell is the clear choice. Quality is lower than Dev but still well above SDXL for prompt adherence.

Installing ComfyUI-GGUF

Via ComfyUI Manager (recommended): search for “ComfyUI-GGUF” by city96, install, restart.

Manual:

cd ComfyUI/custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF
pip install -r ComfyUI-GGUF/requirements.txt

After restart, a “bootleg” node category appears. That’s where the GGUF-specific loaders live.

What to download

Three model components are required:

1. The UNet (main model) — the quantized .gguf file. Place in ComfyUI/models/unet/

flux1-dev-Q5_K_S.gguf (~8.1 GB on disk) — recommended for 8GB; best quality at this VRAM tier
flux1-dev-Q4_0.gguf (~7.1 GB) — use if Q5 is too tight or you’re stacking LoRAs

2. Text encoders — place in ComfyUI/models/clip/

clip_l.safetensors (~235 MB)
t5xxl_fp8_e4m3fn.safetensors — T5-XXL at fp8 precision; use fp8 rather than fp16 to keep VRAM under control

3. VAE — ae.safetensors (~335 MB). Place in ComfyUI/models/vae/

All files are on HuggingFace under the black-forest-labs organization and mirrored on CivitAI.

The Flux node graph

Flux does not use the standard Load Checkpoint node — it’s loaded in pieces:

Unet Loader (GGUF) → select your .gguf UNet file
DualCLIPLoader (gguf) → clip_l.safetensors + t5xxl_fp8_e4m3fn.safetensors
CLIP Text Encode (Flux) → your prompt (Flux ignores negative prompts — don’t bother)
Empty SD3 Latent Image → 1024×1024
ModelSamplingFlux → guidance 3.5, shift 1.15
KSampler or SamplerCustomAdvanced → 25–28 steps, Euler
VAE Loader → ae.safetensors
VAE Decode
Save Image

Flux’s text adherence is significantly better than SDXL’s — short, precise prompts work better than the long trigger-word-laden strings common in SDXL LoRA workflows. Describe what you want; the model follows.

Q4 vs Q5: what you’re trading

Community-collected benchmarks put Q5_K_S at roughly 95% quality retention compared to full FP16, with the remaining 5% visible only in fine details — intricate text rendering, very fine patterns, and high-zoom facial features. Q4_0 drops to around 75–85% retention, with the degradation showing more in faces, skin textures, and complex scene composition.

For most 8GB use cases: Q5_K_S. Use Q4_0 only when you need the VRAM headroom for LoRAs or when generating at high resolution where Q5 pushes against the limit.

SDXL vs Flux on 8GB: side by side

	SDXL 1.0 (pruned FP16)	Flux.1 Dev (GGUF Q5_K_S)
Peak inference VRAM	~5.5–6.5 GB	~7.5–8 GB
VRAM headroom for LoRAs	Comfortable	Tight
Prompt adherence	Good	Excellent
Photorealism ceiling	Good	Superior
Generation speed	Faster	Significantly slower
LoRA ecosystem	Very large, mature	Growing quickly
ControlNet support	Mature	Limited, improving
License	CreativeML OpenRAIL++-M	FLUX.1-dev Non-Commercial
Commercial option	Via licensed derivatives	Schnell (Apache 2.0)
Setup complexity	Low	Medium

Pick SDXL if: faster iteration matters, you rely on LoRAs and ControlNet from the existing SDXL ecosystem, or you’re on AMD where GGUF has rougher support.

Pick Flux if: prompt fidelity and photorealism are the priority, you’re doing personal or non-commercial work, and you’re on CUDA hardware.

The honest day-to-day difference on an RTX 3070: SDXL generates images noticeably faster with excellent results for stylized and fine-tuned subjects. Flux takes longer but handles complex multi-subject prompts and photorealistic scenes that would require significant prompt engineering to get right in SDXL. Neither is strictly better — they optimize for different things.

When 8GB stops being enough

Some workflows hit a hard ceiling regardless of flags and quantization:

Video generation: AnimateDiff, Wan, and other temporal diffusion models need substantially more VRAM for the attention across frames. 8GB produces very short clips and crashes on anything ambitious.

LoRA training: Training SDXL LoRAs in Kohya SS needs 12–16 GB minimum. Fine-tuning Flux requires 24 GB+. Running inference on a trained model is one thing; training it is another category entirely.

Stacking multiple LoRAs on Flux: Two LoRAs on Flux Dev Q5_K_S will push past 8GB on most cards. Either drop to Q4 or accept one LoRA at a time.

Upscaled generation at native large resolutions: Generating at 2048×2048 natively isn’t feasible on 8GB. Upscale from 1024 instead.

For training workflows, RunPod rents A40 GPUs (48 GB VRAM) at under $0.30/hr — cheaper than buying a 24GB card if training is occasional rather than daily. For permanent home lab hardware recommendations and a breakdown of the 16GB and 24GB consumer tiers, see the GPU guides at runaihome.com.

1V1 PLAYBOOK · LOCAL LLM

Cut your local AI bill from $400/month cloud GPU to $47/month at home.

4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.

Get it for $19 (early bird) →

Sources

ComfyUI Releases — Comfy-Org/ComfyUI — v0.22.0 release date
Image Generation VRAM Requirements 2026: Flux, SDXL, SD 3.5 Compared — VRAM usage by model and precision
FLUX GGUF Quantization: Run FLUX on 8GB VRAM (2026) — Q4/Q5 VRAM figures and quality retention estimates
VRAM Optimization Flags Explained — ComfyUI Guide — --lowvram, --medvram, --force-fp16 behavior
ComfyUI-GGUF by city96 — GGUF loader extension, node names
Strange VRAM consumption (SDXL barely fits in 8GB VRAM) — ComfyUI Issue #2855 — sdxl-vae-fp16-fix recommendation and 319 MB size
FLUX.1 Model Licenses — black-forest-labs/flux — Schnell Apache 2.0, Dev Non-Commercial license terms
SDXL System Requirements — stablediffusionxl.com — 8GB VRAM minimum requirement, CreativeML OpenRAIL++-M license
GPU Buying Guide for AI Art — ComfyUI Wiki — consumer GPU tier overview

Recommended Gear

The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):

Was this article helpful?