Text Generation WebUI Review 2026: oobabooga Updated

aiopensourceselfhostedllmreview

TextGen — the project most people still call “oobabooga” or “text-generation-webui” — has been the power user’s local LLM frontend since 2023. It’s where you end up when Ollama’s simplicity becomes a constraint: when you need to swap backends without restarting, run LoRA fine-tuning from the same interface, or wire in custom tool functions to a chat session. The flexibility is real, and so is the complexity cost.

This review covers v4.9, released May 20, 2026. License: AGPL-3.0. The project lives at github.com/oobabooga/textgen (47k+ stars), recently rebranded from text-generation-webui to textgen.

What the v4.x series changed

The v4 release cycle was a significant overhaul. Three changes matter the most:

A native desktop app. v4.7.3 introduced Electron bundling — run textgen.bat on Windows or textgen on Linux/macOS, and a desktop window opens instead of a browser tab. You can still use --nowebui to run the server headlessly, or --listen to expose it on your network. This is optional, not mandatory, but it means non-developers can install it like any other app.

A rebuilt UI. Same release overhauled the visual layer: Inter font replacing the old defaults, Lucide SVG icons replacing emoji buttons, a segmented control for chat mode selection, and a redesigned chat input. It now looks like a real product rather than a hackathon project.

A custom Gradio fork. This is the less visible but more important change. The v4.0 release replaced standard Gradio with a patched fork where “the UI now does far less redundant work on every update, startup is faster, SSE message delivery is instant instead of polling every 50 ms.” The visible effect: the chat interface feels noticeably more responsive compared to v3.x.

v4.9 (the current build as of May 2026) adds MTP speculative decoding support — auto-enabled when loading MTP GGUF builds such as Qwen 3.6 MoE — along with live tokens/s and context size display during generation, and CORS and path traversal security fixes.

Installation

Three paths:

Portable builds are the simplest. Download the Windows .zip, Linux tarball, or macOS package from the releases page. Extract and run. The portable includes Python, all dependencies, and Electron — nothing to install separately. Size is roughly 10GB after model download.

One-click installer (the start_windows.bat / start_linux.sh approach) uses Conda to set up a fresh Python environment. More flexible for development, more surface area for things to break.

Docker is the right choice if you’re running this on a server or NAS where you don’t want the GUI. The Docker image handles CUDA and ROCm environments cleanly.

# Portable launch — opens Electron window
./textgen

# Headless server mode — no browser window, just the API
./textgen --nowebui

# Listen on network (e.g. for other devices or Open WebUI)
./textgen --listen --listen-port 7860

The server starts at http://127.0.0.1:7860 by default. If you’re pairing it with Open WebUI as a front-end and want TextGen purely as an inference backend, --nowebui is your flag.

The five backends

This is what sets TextGen apart from simpler runners. Under the Model tab, you choose which inference engine loads your model:

BackendBest forFormat support
llama.cppGeneral use, GGUF, cross-platformGGUF (Q4–Q8, fp16)
ik_llama.cppAlternative llama.cpp with different architecture handlingGGUF
ExLlamaV3Maximum GPU speed with EXL3 quantizationEXL3, GPTQ
TransformersHugging Face models, research usefp16, bf16, AWQ
TensorRT-LLMNVIDIA production inferenceEngine files

For most users: llama.cpp for everyday GGUF models, ExLlamaV3 if you’re on NVIDIA and want significantly better throughput. The Transformers backend is the most flexible but also the slowest — useful for newly released models that haven’t been converted to GGUF yet.

The ability to switch backends without restarting the application, just by reloading the model under a different loader, is a genuine productivity advantage when you’re evaluating multiple models or formats.

Chat modes, personas, and multimodal

TextGen’s Chat tab covers more ground than Ollama’s chat interface or even Open WebUI in some areas:

Chat modes. Three options: instruct (standard assistant format), chat (freeform without system prompt), and chat-instruct (applies the model’s instruction template to chat history). Each handles the conversation format differently — if you’re getting weird output, this is often why.

Character personas. The tool ships with character card support and a persona system. You can define the AI’s name, description, personality, and greeting, save it as a card, and load it per conversation. There’s also a user profile system added in recent releases — save your name and bio to switch between personas consistently across sessions.

Multimodal. Vision models (LLaVA variants, Qwen-VL, etc.) work in TextGen with image attachment support. The app auto-detects sibling mmproj files when loading a multimodal GGUF — you don’t need to specify it manually as of v4.9.

File attachments. Text, PDF, and DOCX files can be attached to a conversation. This is basic RAG compared to a dedicated tool like AnythingLLM, but it’s useful for one-off document queries without setting up a full vector pipeline.

Tool calling. As of v4.x, models can call custom Python functions during chat. Tools live in user_data/tools/ as individual .py files. Five built-in examples: web_search, fetch_webpage, calculate, get_datetime, and roll_dice. Adding your own tool means writing a single Python file — no framework, no decorator hell. Tool calling currently works reliably with Qwen 3.5, DeepSeek V3.2, Llama 4, and GLM 5; test other models before depending on it in production.

LoRA training

The Training tab covers fine-tuning via LoRA — and this is where TextGen has no peer among local UI tools. The training system was overhauled in v4.0 to align with axolotl conventions: it now accepts OpenAI message format and ShareGPT conversation datasets, handles multi-turn chat with proper token masking, and supports resuming interrupted runs.

You’re not going to fine-tune a 70B model with 8GB VRAM, but for 7B models on a 16GB+ card, this is a real option:

Training Tab → Dataset → Load (OpenAI JSONL format)
Training Tab → LoRA settings → Rank (8 or 16 for most tasks)
Training Tab → Start training

The output is a LoRA adapter you can load alongside the base model. If you want to go deeper — full fine-tunes, larger datasets, distributed training — Unsloth or axolotl are the right tools (see our Unsloth vs axolotl comparison). TextGen’s training tab is for targeted, on-device fine-tuning with minimal configuration.

API server

TextGen exposes an OpenAI-compatible REST API that covers /v1/chat/completions, /v1/completions, and /v1/models. An Anthropic-compatible layer exists for tools that expect that format. The API supports parallel requests across llama.cpp, ExLlamaV3, and TensorRT-LLM backends — added in v4.0 to handle multiple concurrent callers without serializing everything through a single queue.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:7860/v1", api_key="none")
response = client.chat.completions.create(
    model="model",  # uses whatever is loaded
    messages=[{"role": "user", "content": "What is GGUF?"}]
)
print(response.choices[0].message.content)

If you’re building an application that needs a local OpenAI-compatible endpoint, this works. For production multi-user serving, vLLM is a better fit — TextGen’s API wasn’t designed for high-concurrency workloads.

Hardware requirements

TextGen runs on NVIDIA (CUDA), AMD (ROCm and Vulkan), Apple Silicon (Metal via llama.cpp), and CPU-only setups. The requirement is whatever the model needs, not what the app itself needs.

Model sizeQuantizationMinimum VRAMPractical card
7BQ4_K_M (GGUF)6GBRTX 3060 12GB, RX 6700 XT
13BQ4_K_M (GGUF)10GBRTX 3080 10GB, RTX 4070
34BQ4_K_M (GGUF)20GBRTX 3090/4090 24GB
70BQ4_K_M (GGUF)~35GB2× RTX 3090 or VRAM offload

CPU offloading (--n-gpu-layers in llama.cpp) lets you run larger models by offloading some layers to system RAM, at a significant speed penalty. A 70B model with 8GB VRAM and 64GB system RAM is technically runnable — just slow. For GPU rental while evaluating larger models, RunPod offers 80GB A100 and H100 instances with pre-configured environments.

For a deeper look at GPU options that make sense for local LLM work, see runaihome.com — they cover the RTX 4070/4090 vs 3090 tradeoff in detail.

System RAM: 16GB minimum for 7B models when GPU is taking most of the load; 32GB+ if you’re CPU-offloading anything. Storage: ~10GB for the app, plus model files (7B Q4_K_M ≈ 4.7GB, 70B Q4_K_M ≈ 40GB).

TextGen vs Ollama vs LM Studio

TextGen v4.9OllamaLM Studio
Setup time10–20 min2 min3 min
UIWeb/ElectronCLI + third-partyDesktop GUI
Backends5 (llama.cpp, ExLlamaV3, Transformers, TensorRT-LLM, ik_llama.cpp)1 (llama.cpp-based)1 (llama.cpp-based)
LoRA trainingYes (built-in)NoNo
Tool callingYes (custom Python)No (via extensions)No
MultimodalYesYesYes
OpenAI APIYesYesYes
Model formatGGUF, EXL3, GPTQ, fp16, AWQGGUFGGUF
LicenseAGPL-3.0MITProprietary
Ideal forPower users, researchers, devs who need flexibilityCLI users, API server, simple local inferenceNon-devs wanting a polished local GUI

The AGPL-3.0 license is worth noting. Unlike Ollama (MIT) or LM Studio (proprietary), AGPL-3.0 means any service you build on top of TextGen and distribute to users must also be open-source. For personal use this doesn’t matter. For commercial SaaS products, it’s a legal consideration.

When NOT to use TextGen

You want zero setup friction. If you just want to chat with a local model in five minutes, Ollama or GPT4All will get you there without choosing backends or managing install paths. TextGen rewards investment.

You need production multi-user serving. TextGen’s API handles development and light personal use. For real concurrency — dozens of simultaneous requests, SLA requirements, GPU utilization optimization — vLLM is the right tool.

You’re building an application, not a personal setup. The AGPL-3.0 license complicates commercial use. Ollama (MIT) or running inference via a commercial API is cleaner if legal encumbrances matter for your product.

You’re on Windows and want a polished consumer experience. LM Studio’s model browser, clean UI, and simple configuration are genuinely better for non-developer users who aren’t interested in what “backend” means.

The verdict

TextGen is the correct tool if you’re at the intersection of: wanting a local LLM UI, caring about which inference backend you use, and doing more than just chat — training adapters, running vision models, writing tools, or serving an API alongside a front-end. The v4.x series closed the gap on polish considerably; it’s no longer the hobbyist tool it was in 2023.

The complexity is still there. This is not the app you hand to someone who wants to try AI. It’s the app you reach for when you’ve outgrown the apps that are.

The AGPL license means you should check your use case before depending on it in a product. For personal and research use, the license is irrelevant.

1V1 PLAYBOOK · LOCAL LLM

Cut your local AI bill from $400/month cloud GPU to $47/month at home.

4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.

Get it for $19 (early bird) →

Sources


The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):

Was this article helpful?