Jun 1, 2026

KoboldCpp vs Ollama vs llama.cpp for Creative Writing 2026

By AIFoss · 12 min read

TL;DR: KoboldCpp wins for fiction and roleplay outright — its sampler stack, story mode, and native SillyTavern API make it the only one purpose-built for creative text. Ollama is better for developers who need a clean API for chat-first apps. llama.cpp direct is for power users who want raw token control and are comfortable scripting their own pipelines.

	KoboldCpp v1.114.1	Ollama v0.24.0	llama.cpp (b9145)
Best for	Roleplay, fiction, long-form story	Developer chatbots, API consumers	Custom scripts, research, maximum flexibility
Sampler control	Full (DRY, mirostat, CFG, XTC, temperature, top-k, top-p, min-p)	Partial (temperature, top-p, top-k via API params)	Full via server params or completion endpoint
Prompt formatting	Raw or template — your choice	Forces model chat template by default	Raw via /completion endpoint
Story features	Story mode, World Info, lorebooks, branching	None	None
License	AGPLv3	MIT	MIT
Setup effort	Download single binary	`curl install` + `ollama pull`	Build from source or download binary

Honest take: Use KoboldCpp if you’re writing fiction or doing roleplay. Use Ollama if you’re building a developer project. Use llama.cpp raw only if you want to pipe output through a script and know what you’re doing.

Why the Generic “LLM Runner” Reviews Miss Creative Writers

Every comparison of local LLM runners treats them as interchangeable chat backends. For code generation or Q&A, they largely are. For fiction writing and roleplay, the differences are sharp and practical.

Creative writers and game developers need three things that general-purpose LLM runners treat as afterthoughts:

Sampler control. The quality of generated prose is heavily sensitive to sampling parameters — specifically to repetition handling, which is the biggest single failure mode in long-form generation. A model that keeps cycling back to the same phrases or sentence structures destroys narrative immersion. DRY (Dynamic Repetition Yesterday), mirostat, and XTC samplers exist to fix exactly this, and not all runners expose them.
Prompt formatting freedom. Chat-optimized frontends force all input through a structured Human/Assistant message format. That’s fine for Q&A. Fiction generation often needs raw continuation — you give the model partial prose and it keeps writing, not a chatbot response. Runners that lock you into a chat template break this workflow.
Story memory. A 32k token context fills up. You need tools for deciding what stays in context (scene summaries, character descriptions, active plot) versus what gets trimmed. Generic runners don’t address this at all.

The Three Tools at a Glance

KoboldCpp v1.114.1 is built directly on llama.cpp’s inference core but wraps it in a GUI and API layer specifically designed for creative text generation. It’s a single executable — download, point at a GGUF file, run. Maintained by LostRuins on GitHub under AGPLv3. Active development, with releases roughly every two to four weeks.

Ollama v0.24.0 is the most popular local LLM runner for developers. Clean CLI, model library at ollama pull, OpenAI-compatible REST API. The template system handles dozens of model formats automatically. MIT license. Excellent for building chat applications; not designed for fiction workflows.

llama.cpp (build b9145) is the underlying inference engine that KoboldCpp (and Ollama, for many operations) is built on. When you run it directly via llama-server or llama-cli, you get the full sampler chain without any UI layer. Maximum control, minimum hand-holding. MIT license.

Sampler Control: The Clearest Win for KoboldCpp

This is where creative writers should pay the most attention.

KoboldCpp exposes the full sampling stack from its GUI and API: temperature, top-k, top-p, min-p, tail-free sampling, typical sampling, repetition penalty, presence penalty, frequency penalty, DRY, mirostat (v1 and v2), CFG (classifier-free guidance), and XTC.

For creative writing specifically, three samplers matter most:

DRY (Dynamic Repetition Yesterday) is a token-level anti-repetition sampler designed as a more nuanced alternative to the blunt repetition-penalty approach. Instead of penalizing any repeated token equally, DRY tracks N-gram sequences and penalizes the continuation of recently-seen patterns. This matters enormously in long-form prose: the model won’t repeat the same descriptive phrase four paragraphs later even if repetition-penalty alone wouldn’t catch it.

Mirostat controls the perplexity of output directly rather than adjusting raw token probabilities. It adapts sampling parameters in real time to target a specific “surprisingness” level. For fiction generation, Mirostat v2 with a tau of around 5.0 tends to produce more coherent long-form prose than pure top-p sampling.

XTC (Exclude Top Choices) strips the most statistically likely tokens when there are many valid continuations, forcing the model toward less predictable but more interesting choices. The same authors who wrote DRY wrote XTC — the two work well together.

Ollama’s API exposes temperature, top_k, top_p, repeat_penalty, and a handful of others, but DRY, mirostat, XTC, and CFG are not available through the standard Ollama API. You can’t set them in a Modelfile either.

llama.cpp’s llama-server server exposes the full sampler chain — the same defaults as KoboldCpp in many cases — via its /completion endpoint. The sampler order is: penalties, DRY, top_n_sigma, top_k, typ_p, top_p, min_p, XTC, temperature. You can override any of these per-request in JSON. So llama.cpp direct is a functional alternative for sampler control, just with no UI.

Example llama-server request with DRY enabled:

curl http://localhost:8080/completion \
  -H 'Content-Type: application/json' \
  -d '{
    "prompt": "The old lighthouse keeper looked out at the storm and",
    "n_predict": 200,
    "temperature": 0.75,
    "dry_multiplier": 0.8,
    "dry_base": 1.75,
    "dry_allowed_length": 2,
    "mirostat": 2,
    "mirostat_tau": 5.0,
    "repeat_penalty": 1.1
  }'

KoboldCpp wraps this same kind of control behind a GUI slider set, making it accessible to writers who don’t want to craft JSON payloads.

Prompt Formatting: Chat Templates vs Raw Continuation

This is the second critical difference for fiction writing.

Ollama uses a template system that converts chat messages into the input format each model expects. When you use ollama run mistral, it wraps your prompt in Mistral’s [INST] tokens. When you use the API, you send structured messages and Ollama handles formatting. For chat, this is exactly right. For raw continuation, it’s a problem.

Fiction generation often means: given this partial paragraph, continue writing. You don’t want the model to interpret that as a user message and generate a “helpful assistant” response. You want it to keep writing prose in the same voice and style. Ollama’s template system makes this harder than it should be. You can write a custom TEMPLATE in a Modelfile to work around it, but it’s not the path of least resistance.

KoboldCpp and llama.cpp raw both support true continuation by default. KoboldCpp’s “Story Mode” in the built-in UI is literally designed around this — you write a passage, press continue, and the model picks up the thread without any chat wrapper. The /api/generate endpoint for KoboldAI-style requests works the same way.

llama.cpp’s /completion endpoint also accepts raw prompts with no template enforcement. If you send a raw string, it generates a raw continuation. This makes it a viable alternative for developers scripting story generation pipelines.

Context Length and Memory Management

All three runners support long context in the underlying model. Setting context in each:

# KoboldCpp GUI: Context Size slider in launcher
# KoboldCpp CLI:
koboldcpp --model model.gguf --contextsize 32768

# Ollama via Modelfile:
# PARAMETER num_ctx 32768

# llama-server:
llama-server --model model.gguf -c 32768

Where KoboldCpp meaningfully differentiates is in what happens when context fills up. KoboldCpp’s built-in UI includes context management tools: you can pin specific entries (character descriptions, scene-setting passages) so they’re never trimmed, and configure how older content rolls off. For a 10,000-word roleplay session, this matters.

World Info (lorebooks) work alongside this: you define keyword-triggered entries — character names, place names, faction names — that KoboldCpp automatically injects into the prompt when those terms appear, without consuming context permanently. It’s a compact way to maintain story consistency across sessions that exceed the model’s context window.

Ollama and llama.cpp raw have no equivalent feature. You’re managing context yourself.

The playbook article on context windows covers the underlying tradeoffs between 8k, 32k, and 128k contexts — worth reading before you commit to a context size.

Model Download and Loading UX

For writers who want to start quickly:

Ollama has the best model management UX by far. ollama pull llama3.2:8b downloads a quantized model, handles GGUF conversion, and makes it available immediately. Built-in library. Cross-platform (Mac, Windows, Linux).

KoboldCpp requires you to download GGUF files manually from Hugging Face or another source, then point the launcher at the file path. More friction upfront, but once you have the model you have full control over every quantization detail. The GUI launcher on Windows handles GPU layer configuration, context size, and backend selection (CUDA, ROCm, Vulkan, CPU) with dropdowns rather than flags.

llama.cpp is the lowest-level option: download or build the binary, download the GGUF file separately, pass both paths on the command line. No model management, no library, no discovery UI. If you’re not comfortable with a terminal this is a barrier.

The quantization guide explains what Q4_K_M vs Q8_0 means for quality and memory, which matters when you’re manually choosing GGUF variants from Hugging Face.

SillyTavern and Frontend Integration

SillyTavern is the most commonly used frontend for roleplay and creative writing with local LLMs. It’s worth noting how each backend integrates.

KoboldCpp: native KoboldAI API at http://localhost:5001. SillyTavern has first-class support for this endpoint, including support for KoboldCpp-specific features like sampler settings passed directly from SillyTavern’s UI. This is the recommended stack for local creative writing.

Ollama: supported via OpenAI-compatible endpoint. Works in SillyTavern, but sampler options are limited to what Ollama’s API exposes. You lose DRY, XTC, and mirostat control from the SillyTavern side.

llama.cpp: exposes an OpenAI-compatible endpoint at http://localhost:8080. Same situation as Ollama — works as a backend but the extra sampler capabilities require direct API calls, not SillyTavern configuration.

When NOT to Use Each Tool

Don’t use KoboldCpp if you’re building a developer application that needs a clean OpenAI-compatible REST API. KoboldCpp exposes one, but it’s not the primary design goal, and running a KoboldCpp server in a production environment is more awkward than Ollama. The AGPLv3 license also has copyleft implications if you’re distributing software that links against it — check whether that affects your use case.

Don’t use Ollama if you care about sampler control, raw prompt continuation, or story-specific features. Ollama is excellent at what it does; creative writing just isn’t what it does. Forcing fiction workflows through Ollama’s chat-template architecture is friction you don’t need when KoboldCpp or llama.cpp solve it directly.

Don’t use llama.cpp raw if you’re not comfortable writing your own request formatting, prompt management, and context trimming code. The power is real; the tooling for non-developers is not there. For technical writers or game developers comfortable with a Python script, it’s excellent. For everyone else, KoboldCpp gives you most of the same control with a working UI.

Full Feature Comparison

	KoboldCpp v1.114.1	Ollama v0.24.0	llama.cpp (b9145)
Sampler: DRY	Yes	No	Yes (via /completion)
Sampler: Mirostat	Yes (v1 + v2)	No	Yes
Sampler: XTC	Yes	No	Yes
Sampler: CFG	Yes	No	No (removed in recent builds)
Raw continuation	Yes	Requires Modelfile hack	Yes (/completion endpoint)
Story mode UI	Yes	No	No
World Info / lorebooks	Yes	No	No
Branching stories	Yes	No	No
Speech-to-text (Whisper)	Built-in	No	No
TTS output	Built-in	No	No
SillyTavern (native)	KoboldAI endpoint	OpenAI endpoint	OpenAI endpoint
Model management	Manual GGUF	`ollama pull`	Manual GGUF
GPU backends	CUDA, ROCm, Vulkan, Metal, CPU	CUDA, ROCm, Metal, CPU	CUDA, ROCm, Vulkan, Metal, CPU
Multi-modal	Limited	Yes (LLaVA etc.)	Yes
License	AGPLv3	MIT	MIT

Frequently Asked Questions

Can I use KoboldCpp with SillyTavern for free? Yes. Both are open-source and free to use. Run KoboldCpp at localhost:5001, connect SillyTavern to the KoboldAI endpoint, and you have a fully local, private setup. No API keys or subscriptions required.

Does Ollama support DRY or mirostat for creative writing? Not through its standard API. Ollama exposes temperature, top-k, top-p, and repeat_penalty, but DRY and mirostat are not available as parameters. If those samplers matter to your workflow, use KoboldCpp or llama-server directly instead.

Is KoboldCpp’s AGPLv3 license a problem for personal use? For personal use — running it on your own machine for writing or roleplay — AGPLv3 is not a problem at all. The copyleft provisions only matter if you’re distributing software that incorporates KoboldCpp’s code. The full license text is on GitHub if you need the specifics.

Which runner handles non-English text best? All three use the same underlying GGUF tokenizer, so tokenizer quality depends on the model, not the runner. KoboldCpp includes a tokenizer-correct mode that helps with non-English languages by counting tokens accurately for context management. For Japanese, Chinese, or multilingual fiction, model selection matters more than runner choice.

Can I switch from Ollama to KoboldCpp without re-downloading models? Ollama stores its models in a proprietary internal format. You’d need to re-download the models as GGUF files from Hugging Face. The underlying model weights are the same — Ollama wraps them differently. Tools exist to extract Ollama’s internal GGUF files if needed, but re-downloading is usually simpler.

Sources

Was this article helpful?