May 25, 2026

Text Generation WebUI Review 2026: oobabooga Updated

By AIFoss · 11 min read

aiopensourceselfhostedllmreview

TextGen — the project most people still call “oobabooga” or “text-generation-webui” — has been the power user’s local LLM frontend since 2023. It’s where you end up when Ollama’s simplicity becomes a constraint: when you need to swap backends without restarting, run LoRA fine-tuning from the same interface, or wire in custom tool functions to a chat session. The flexibility is real, and so is the complexity cost.

This review covers v4.9, released May 20, 2026. License: AGPL-3.0. The project lives at github.com/oobabooga/textgen (47k+ stars), recently rebranded from text-generation-webui to textgen.

What the v4.x series changed

The v4 release cycle was a significant overhaul. Three changes matter the most:

A native desktop app. v4.7.3 introduced Electron bundling — run textgen.bat on Windows or textgen on Linux/macOS, and a desktop window opens instead of a browser tab. You can still use --nowebui to run the server headlessly, or --listen to expose it on your network. This is optional, not mandatory, but it means non-developers can install it like any other app.

A rebuilt UI. Same release overhauled the visual layer: Inter font replacing the old defaults, Lucide SVG icons replacing emoji buttons, a segmented control for chat mode selection, and a redesigned chat input. It now looks like a real product rather than a hackathon project.

A custom Gradio fork. This is the less visible but more important change. The v4.0 release replaced standard Gradio with a patched fork where “the UI now does far less redundant work on every update, startup is faster, SSE message delivery is instant instead of polling every 50 ms.” The visible effect: the chat interface feels noticeably more responsive compared to v3.x.

v4.9 (the current build as of May 2026) adds MTP speculative decoding support — auto-enabled when loading MTP GGUF builds such as Qwen 3.6 MoE — along with live tokens/s and context size display during generation, and CORS and path traversal security fixes.

Installation

Three paths:

Portable builds are the simplest. Download the Windows .zip, Linux tarball, or macOS package from the releases page. Extract and run. The portable includes Python, all dependencies, and Electron — nothing to install separately. Size is roughly 10GB after model download.

One-click installer (the start_windows.bat / start_linux.sh approach) uses Conda to set up a fresh Python environment. More flexible for development, more surface area for things to break.

Docker is the right choice if you’re running this on a server or NAS where you don’t want the GUI. The Docker image handles CUDA and ROCm environments cleanly.

# Portable launch — opens Electron window
./textgen

# Headless server mode — no browser window, just the API
./textgen --nowebui

# Listen on network (e.g. for other devices or Open WebUI)
./textgen --listen --listen-port 7860

The server starts at http://127.0.0.1:7860 by default. If you’re pairing it with Open WebUI as a front-end and want TextGen purely as an inference backend, --nowebui is your flag.

The five backends

This is what sets TextGen apart from simpler runners. Under the Model tab, you choose which inference engine loads your model:

Backend	Best for	Format support
llama.cpp	General use, GGUF, cross-platform	GGUF (Q4–Q8, fp16)
ik_llama.cpp	Alternative llama.cpp with different architecture handling	GGUF
ExLlamaV3	Maximum GPU speed with EXL3 quantization	EXL3, GPTQ
Transformers	Hugging Face models, research use	fp16, bf16, AWQ
TensorRT-LLM	NVIDIA production inference	Engine files

For most users: llama.cpp for everyday GGUF models, ExLlamaV3 if you’re on NVIDIA and want significantly better throughput. The Transformers backend is the most flexible but also the slowest — useful for newly released models that haven’t been converted to GGUF yet.

The ability to switch backends without restarting the application, just by reloading the model under a different loader, is a genuine productivity advantage when you’re evaluating multiple models or formats.

Chat modes, personas, and multimodal

TextGen’s Chat tab covers more ground than Ollama’s chat interface or even Open WebUI in some areas:

Chat modes. Three options: instruct (standard assistant format), chat (freeform without system prompt), and chat-instruct (applies the model’s instruction template to chat history). Each handles the conversation format differently — if you’re getting weird output, this is often why.

Character personas. The tool ships with character card support and a persona system. You can define the AI’s name, description, personality, and greeting, save it as a card, and load it per conversation. There’s also a user profile system added in recent releases — save your name and bio to switch between personas consistently across sessions.

Multimodal. Vision models (LLaVA variants, Qwen-VL, etc.) work in TextGen with image attachment support. The app auto-detects sibling mmproj files when loading a multimodal GGUF — you don’t need to specify it manually as of v4.9.

File attachments. Text, PDF, and DOCX files can be attached to a conversation. This is basic RAG compared to a dedicated tool like AnythingLLM, but it’s useful for one-off document queries without setting up a full vector pipeline.

Tool calling. As of v4.x, models can call custom Python functions during chat. Tools live in user_data/tools/ as individual .py files. Five built-in examples: web_search, fetch_webpage, calculate, get_datetime, and roll_dice. Adding your own tool means writing a single Python file — no framework, no decorator hell. Tool calling currently works reliably with Qwen 3.5, DeepSeek V3.2, Llama 4, and GLM 5; test other models before depending on it in production.

LoRA training

The Training tab covers fine-tuning via LoRA — and this is where TextGen has no peer among local UI tools. The training system was overhauled in v4.0 to align with axolotl conventions: it now accepts OpenAI message format and ShareGPT conversation datasets, handles multi-turn chat with proper token masking, and supports resuming interrupted runs.

You’re not going to fine-tune a 70B model with 8GB VRAM, but for 7B models on a 16GB+ card, this is a real option:

Training Tab → Dataset → Load (OpenAI JSONL format)
Training Tab → LoRA settings → Rank (8 or 16 for most tasks)
Training Tab → Start training

The output is a LoRA adapter you can load alongside the base model. If you want to go deeper — full fine-tunes, larger datasets, distributed training — Unsloth or axolotl are the right tools (see our Unsloth vs axolotl comparison). TextGen’s training tab is for targeted, on-device fine-tuning with minimal configuration.

API server

TextGen exposes an OpenAI-compatible REST API that covers /v1/chat/completions, /v1/completions, and /v1/models. An Anthropic-compatible layer exists for tools that expect that format. The API supports parallel requests across llama.cpp, ExLlamaV3, and TensorRT-LLM backends — added in v4.0 to handle multiple concurrent callers without serializing everything through a single queue.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:7860/v1", api_key="none")
response = client.chat.completions.create(
    model="model",  # uses whatever is loaded
    messages=[{"role": "user", "content": "What is GGUF?"}]
)
print(response.choices[0].message.content)

If you’re building an application that needs a local OpenAI-compatible endpoint, this works. For production multi-user serving, vLLM is a better fit — TextGen’s API wasn’t designed for high-concurrency workloads.

Hardware requirements

TextGen runs on NVIDIA (CUDA), AMD (ROCm and Vulkan), Apple Silicon (Metal via llama.cpp), and CPU-only setups. The requirement is whatever the model needs, not what the app itself needs.

Model size	Quantization	Minimum VRAM	Practical card
7B	Q4_K_M (GGUF)	6GB	RTX 3060 12GB, RX 6700 XT
13B	Q4_K_M (GGUF)	10GB	RTX 3080 10GB, RTX 4070
34B	Q4_K_M (GGUF)	20GB	RTX 3090/4090 24GB
70B	Q4_K_M (GGUF)	~35GB	2× RTX 3090 or VRAM offload

CPU offloading (--n-gpu-layers in llama.cpp) lets you run larger models by offloading some layers to system RAM, at a significant speed penalty. A 70B model with 8GB VRAM and 64GB system RAM is technically runnable — just slow. For GPU rental while evaluating larger models, RunPod offers 80GB A100 and H100 instances with pre-configured environments.

For a deeper look at GPU options that make sense for local LLM work, see runaihome.com — they cover the RTX 4070/4090 vs 3090 tradeoff in detail.

System RAM: 16GB minimum for 7B models when GPU is taking most of the load; 32GB+ if you’re CPU-offloading anything. Storage: ~10GB for the app, plus model files (7B Q4_K_M ≈ 4.7GB, 70B Q4_K_M ≈ 40GB).

TextGen vs Ollama vs LM Studio

	TextGen v4.9	Ollama	LM Studio
Setup time	10–20 min	2 min	3 min
UI	Web/Electron	CLI + third-party	Desktop GUI
Backends	5 (llama.cpp, ExLlamaV3, Transformers, TensorRT-LLM, ik_llama.cpp)	1 (llama.cpp-based)	1 (llama.cpp-based)
LoRA training	Yes (built-in)	No	No
Tool calling	Yes (custom Python)	No (via extensions)	No
Multimodal	Yes	Yes	Yes
OpenAI API	Yes	Yes	Yes
Model format	GGUF, EXL3, GPTQ, fp16, AWQ	GGUF	GGUF
License	AGPL-3.0	MIT	Proprietary
Ideal for	Power users, researchers, devs who need flexibility	CLI users, API server, simple local inference	Non-devs wanting a polished local GUI

The AGPL-3.0 license is worth noting. Unlike Ollama (MIT) or LM Studio (proprietary), AGPL-3.0 means any service you build on top of TextGen and distribute to users must also be open-source. For personal use this doesn’t matter. For commercial SaaS products, it’s a legal consideration.

When NOT to use TextGen

You want zero setup friction. If you just want to chat with a local model in five minutes, Ollama or GPT4All will get you there without choosing backends or managing install paths. TextGen rewards investment.

You need production multi-user serving. TextGen’s API handles development and light personal use. For real concurrency — dozens of simultaneous requests, SLA requirements, GPU utilization optimization — vLLM is the right tool.

You’re building an application, not a personal setup. The AGPL-3.0 license complicates commercial use. Ollama (MIT) or running inference via a commercial API is cleaner if legal encumbrances matter for your product.

You’re on Windows and want a polished consumer experience. LM Studio’s model browser, clean UI, and simple configuration are genuinely better for non-developer users who aren’t interested in what “backend” means.

The verdict

TextGen is the correct tool if you’re at the intersection of: wanting a local LLM UI, caring about which inference backend you use, and doing more than just chat — training adapters, running vision models, writing tools, or serving an API alongside a front-end. The v4.x series closed the gap on polish considerably; it’s no longer the hobbyist tool it was in 2023.

The complexity is still there. This is not the app you hand to someone who wants to try AI. It’s the app you reach for when you’ve outgrown the apps that are.

The AGPL license means you should check your use case before depending on it in a product. For personal and research use, the license is irrelevant.

1V1 PLAYBOOK · LOCAL LLM

Cut your local AI bill from $400/month cloud GPU to $47/month at home.

4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.

Get it for $19 (early bird) →

Sources

Recommended Gear

The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):

Was this article helpful?