Jun 15, 2026

Bonsai Image 4B Review 2026: 1-Bit Local Image Gen

By AIFoss · 9 min read

bonsai-imageimage-generationfluxquantizationselfhostedapple-silicon

TL;DR: Bonsai Image 4B is PrismML’s quantized fork of FLUX.2 Klein 4B that fits the diffusion transformer in 0.93 GB at 1-bit (or 1.21 GB ternary), under Apache 2.0. It’s the first 4B-class image model that runs on an iPhone. Quality holds up better than you’d expect, but it inherits Klein’s distilled ceiling — this is a portability win, not a quality breakthrough.

	Bonsai 1-bit	Bonsai Ternary	FLUX.2 Klein 4B (FP16)
Best for	Phones, tiny VRAM	Best size/quality balance	Desktop GPUs, max quality
Transformer size	0.93 GB	1.21 GB	7.75 GB
Quality retained	~88%	~95%	100% (baseline)
License	Apache 2.0	Apache 2.0	Apache 2.0
The catch	Visible quality drop	Still 6.4× smaller	Won’t fit a phone

Honest take: If you have a desktop GPU, run the ternary build — 95% of Klein’s quality at a sixth of the size is the obvious pick. The 1-bit version earns its place only on phones and sub-2-GB hardware.

PrismML released Bonsai Image 4B on May 26, 2026, and the pitch is unusual for the local image-gen space: instead of a new architecture, it’s an aggressive quantization of an existing Apache 2.0 model — FLUX.2 Klein 4B from Black Forest Labs. The interesting part isn’t the model. It’s that the diffusion transformer now fits in under a gigabyte and still produces usable images.

What Bonsai Image 4B actually is

Start with the base. FLUX.2 Klein 4B is a 4-billion-parameter text-to-image model distilled from Black Forest Labs’ larger FLUX.2 base, released under Apache 2.0. In FP16, its transformer weighs 7.75 GB and needs roughly 13 GB of VRAM to run comfortably — fine on an RTX 4070, out of reach on a phone.

Bonsai takes that exact transformer and crushes the weights down to extreme low-bit formats:

1-bit (binary): weights become {−1, +1}, with one shared FP16 scale per group of 128 weights. Final transformer: 0.93 GB — an 8.3× reduction.
Ternary: weights take three values {−1, 0, +1}. Final transformer: 1.21 GB — a 6.4× reduction.

The text encoder and VAE stay at higher precision, which is why image structure and prompt adherence survive the compression. Only the heavy transformer stack gets binarized.

What you don’t get is a re-trained model. Bonsai is Klein with most of its precision thrown away and a clever packing scheme to make the result run fast. That framing matters for expectations — covered in the limitations below.

The quality numbers (and who measured them)

PrismML reports that the 1-bit build retains ~88% of FLUX.2 Klein 4B’s quality and the ternary build ~95%, measured against the FP16 baseline. These are the vendor’s own figures, not an independent benchmark, so treat them as a directional claim rather than gospel — there’s no public third-party eval as of June 2026.

That said, the numbers track with how low-bit quantization behaves elsewhere. If you’ve read our GGUF quantization guide, the pattern is familiar: ternary and 2-bit-ish formats hold up surprisingly well because the per-group FP16 scales recover most of the dynamic range, while true 1-bit shows visible degradation — softer detail, occasional structural slips, weaker fine text rendering.

The honest read: ternary is “I can’t easily tell it apart from Klein at a glance.” 1-bit is “good enough for a phone, clearly not your desktop output.”

Hardware and the formats you’ll actually download

Bonsai ships in two runtime flavors, and picking the wrong one wastes a download:

Variant	Runtime	HF repo	Runs on
1-bit CUDA	Gemlite INT1	`prism-ml/bonsai-image-binary-4B-gemlite-1bit`	NVIDIA GPUs
1-bit Apple	MLX	`prism-ml/bonsai-image-binary-4B-mlx-1bit`	Apple Silicon
Ternary CUDA	Gemlite INT2	`prism-ml/bonsai-image-ternary-4B-gemlite-2bit`	NVIDIA GPUs
Ternary Apple	MLX	`prism-ml/bonsai-image-ternary-4B-mlx-2bit`	Apple Silicon

On CUDA, the weights use Gemlite’s INT1 packed format. Because of runtime packing and alignment overhead, the on-disk CUDA pack is actually 1.08 GB for the 1-bit build — slightly larger than the raw 0.93 GB transformer, which trips people up. Peak GPU memory at 1024×1024 on an RTX 3080 is roughly 6.4 GiB end-to-end (transformer + VAE + activations), so a 8 GB card has headroom. That’s a meaningful step down from Klein’s ~13 GB FP16 footprint — see our Stable Diffusion on 8GB VRAM guide for where this sits among other options.

On Apple Silicon, the MLX builds run natively, and this is where Bonsai’s headline claim lives: it’s the first image model in its parameter class to run directly on an iPhone. PrismML also shipped Bonsai Studio, an iOS app that defaults to the ternary variant.

Running it on a CUDA box

The CUDA path depends on Gemlite for the low-bit kernels and diffusers for the pipeline. A minimal setup looks like this:

# Python 3.11, CUDA 12.x
pip install diffusers transformers accelerate gemlite torch
huggingface-cli download prism-ml/bonsai-image-binary-4B-gemlite-1bit \
  --local-dir ./bonsai-1bit

Generation uses a FlowMatchEuler-discrete sampler with settings the model was tuned for:

# 4 steps, guidance 1.0, shift 3.0 — these are the intended values
image = pipe(
    prompt="a bonsai tree on a windowsill, soft morning light, 35mm photo",
    num_inference_steps=4,
    guidance_scale=1.0,
).images[0]
image.save("out.png")

The 4-step count isn’t a corner you’re cutting — it’s the design point inherited from Klein’s distillation. Running more steps does not improve quality and can introduce artifacts, so don’t crank num_inference_steps to 20 out of habit. This is the opposite reflex from SDXL on Automatic1111, where more steps usually helps.

One gotcha worth flagging: the Gemlite kernels are CUDA-specific. There’s no CPU fallback that’s remotely usable, and the MLX builds and Gemlite builds are not interchangeable — download the one that matches your hardware or nothing runs.

Bonsai vs. running the full model

For most desktop users the real comparison isn’t 1-bit vs. ternary — it’s “should I bother with Bonsai at all, or just run FLUX.2 Klein 4B at full precision?”

	Bonsai Ternary	FLUX.2 Klein 4B FP16
Min VRAM (1024²)	~6.4 GiB	~13 GB
Quality	~95% of FP16	100%
Speed	4 steps, sub-second class	4 steps, sub-second class
Where it runs	Phone → 8 GB GPU	12 GB+ GPU

If you own a 12 GB or larger card, the full Klein model is right there, also Apache 2.0, also 4-step fast, with no quality compromise. Bonsai’s value proposition collapses to: it fits where Klein can’t. On a phone, an 8 GB GPU, or a low-RAM laptop, that’s the whole ballgame. On a RTX 4070 Ti or better, it’s a solution to a problem you don’t have.

For cloud bursts where you need throughput rather than portability, neither is the answer — a rented GPU on RunPod running full FLUX.2 will beat any quantized local build on images-per-dollar at scale.

When NOT to use Bonsai Image 4B

You have a 12 GB+ GPU. Run FLUX.2 Klein 4B at FP16 instead. Same license, same speed, full quality.
You need maximum fidelity or fine text rendering. The 1-bit build visibly degrades detail; even ternary is a distilled 4B model, not a flagship. For poster-quality output, look at larger Flux or SD 3.5 models.
You want a deep node-based pipeline. Bonsai is a single tuned generation path. If you need ControlNet, regional prompting, or complex graphs, ComfyUI with a full model is the tool.
You’re on AMD or CPU-only. The CUDA build needs Gemlite (NVIDIA), and there’s no practical CPU path. AMD users are out of luck for now.

Licensing — the genuinely good news

Both Bonsai builds ship under Apache 2.0, weights and code. Critically, the base model is too: FLUX.2 Klein 4B is Apache 2.0 and does not require a commercial license. That makes Bonsai one of the few genuinely clean commercial-use options in local image gen — a sharp contrast to FLUX.1 [dev], whose non-commercial license has tripped up countless projects. If you’re shipping a product that generates images on-device, this license stack is about as friction-free as it gets. Verify the current terms on the model card before you ship, but as of June 2026 there’s no asterisk.

FAQ

Is Bonsai Image 4B better than SDXL? Different trade-off. SDXL has a vastly larger LoRA and tooling ecosystem; Bonsai wins on size and on-device deployment. For a phone or an 8 GB card where SDXL struggles, Bonsai is the more practical choice. For a desktop with extensions and fine-tuning, SDXL’s ecosystem still wins.

Can I run it on Windows without an NVIDIA GPU? Not usefully. The CUDA build requires Gemlite kernels (NVIDIA only), and there’s no viable CPU fallback. Apple Silicon users get the MLX builds; AMD and CPU-only users are unsupported as of June 2026.

Why only 4 steps? Bonsai inherits FLUX.2 Klein’s distillation, which is tuned for 4-step generation. More steps don’t improve quality and can add artifacts. Leave num_inference_steps at 4.

1-bit or ternary — which should I download? Ternary unless you’re memory-constrained. It’s only ~0.3 GB larger but retains ~95% quality vs. the 1-bit build’s ~88%. Reserve 1-bit for phones and sub-2-GB targets.

Are the 88% / 95% quality numbers independently verified? No. They’re PrismML’s own measurements against the FP16 baseline. There’s no public third-party benchmark yet, so treat them as directional.

Sources

Recommended Gear

RTX 3080 — comfortably runs Bonsai at 1024² (~6.4 GiB peak)
RTX 4070 — enough VRAM to run full FLUX.2 Klein 4B FP16 instead

Was this article helpful?