May 29, 2026

Local LLM Context Window Guide 2026: 8k, 32k, 128k

By AIFoss · 15 min read

TL;DR: Context window size has a direct, linear cost in VRAM — and most tasks don’t need as much as you think. An 8k window handles the vast majority of chat, coding, and single-document use cases. 32k is the practical ceiling for mid-range GPUs. 128k is technically possible on high-VRAM cards but comes with real quality degradation at scale. Choosing the right window size is a better optimization than buying a bigger GPU.

	8k context	32k context	128k context
Best for	Chat, code completion, single files	Multi-turn research, long code review, medium docs	Full codebase analysis, book-length docs
VRAM overhead (Llama 3.1 8B Q4_K_M)	+~1 GB	+~4 GB	+~16 GB
Generation speed (RTX 3090)	~45 tok/s	~35–40 tok/s	~10–20 tok/s (partial CPU offload likely)
The catch	Truncates long inputs silently	Needs 12–16 GB GPU for 13B+ models	Quality degrades mid-context; expensive

Honest take: For most developers running local LLMs, 16k–32k is the right default: enough for real work, manageable on a 12–16 GB GPU. Reserve 128k for specific use cases where you’ve confirmed the model actually uses that context well.

What the Context Window Actually Is

A context window is the total number of tokens a model processes in a single forward pass — system prompt, chat history, injected documents, and the new input combined. Everything outside this window doesn’t exist to the model.

Token count matters more than word count. In English, 1 token is roughly 0.75 words; code is denser, often 0.5 words per token. Some practical reference points:

8k tokens ≈ 6,000 words ≈ a typical short story or 200–300 lines of code
32k tokens ≈ 24,000 words ≈ a medium research paper or a 500–800 line code file
128k tokens ≈ 96,000 words ≈ a short novel or a large multi-file codebase

The thing most people don’t realize: the model doesn’t remember anything beyond the current context window. A 5-hour chat session that overflows 8k doesn’t cause a graceful summary — it silently drops the oldest messages. Knowing your token budget prevents that from biting you mid-conversation.

The KV Cache: Why Context Eats VRAM

Every token you add to the context window consumes GPU memory — not through the model weights, which are fixed, but through the KV cache (key-value cache). Transformers compute attention over every previous token at each layer; the KV cache stores those intermediate results so the model doesn’t recompute them on every generation step.

The memory cost scales linearly with context length:

KV cache size = 2 × num_layers × num_kv_heads × head_dim × seq_len × dtype_bytes

For Llama 3.1 8B (32 layers, 8 KV heads, 128 head dim, float16):

Context	KV cache	Model weights (Q4_K_M)	Total VRAM
2k (Ollama default)	~0.25 GB	~4.7 GB	~5.0 GB
8k	~1.0 GB	~4.7 GB	~5.7 GB
16k	~2.0 GB	~4.7 GB	~6.7 GB
32k	~4.0 GB	~4.7 GB	~8.7 GB
64k	~8.0 GB	~4.7 GB	~12.7 GB
128k	~16.0 GB	~4.7 GB	~20.7 GB

This is why an 8 GB GPU can run Llama 3.1 8B at 8k context fine (5.7 GB total) but runs out of memory trying to push 32k. The model weights didn’t change — the KV cache ate your VRAM.

For larger models, the KV cache grows proportionally. A 32B model has more layers and wider attention, so its KV cache at 32k context can exceed 12 GB by itself. An RTX 4090 with 24 GB handles a 32B model at 8k context, but hits the wall around 32k. For 128k with a 32B model, you’re looking at NVIDIA A100-class hardware or multi-GPU setups.

Flash attention cuts this cost substantially. Setting OLLAMA_FLASH_ATTENTION=1 reduces KV cache VRAM usage by 30–50% on Ampere and newer GPUs (RTX 3080 and above). Combined with KV cache quantization (available in recent llama.cpp builds), you can roughly double the effective context window before running out of memory — pushing a 128k-capable run on hardware that would normally top out at 64k.

The 8k Sweet Spot

For the majority of local LLM use cases, 8k context is genuinely enough:

Chat conversations: even long sessions rarely exceed 4k tokens of actual meaningful exchange before the early context stops being relevant anyway
Code completion and review: most individual files are under 5k tokens; reviewing a single function or class is typically 1k–3k tokens
Single document Q&A: a 5-page PDF, a README, a blog post — all comfortably within 8k
RAG pipelines: if you’re using retrieval, you’re injecting only the top 3–10 chunks into context, not the full document set. 8k is enough for the retrieved context plus the system prompt

The hidden advantage: at 8k, you stay comfortably on GPU with smaller cards. A mid-range RTX 4070 Ti Super (16 GB) runs a 13B Q4_K_M model at 8k context with headroom to spare, and generation stays above 40 tokens/second.

Ollama’s default context is set low (2048 tokens in most versions) to avoid unexpected out-of-memory errors. Bumping to 8192 is the first configuration change that actually improves usability without meaningful VRAM cost.

# Permanently in a Modelfile
FROM llama3.1:8b
PARAMETER num_ctx 8192

# One-off via CLI
ollama run llama3.1:8b --option num_ctx 8192

# Via the REST API
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "options": {"num_ctx": 8192},
  "prompt": "Summarize this code..."
}'

When 32k Makes Sense

There are specific workflows where 32k pays off:

Multi-turn research conversations: you’re asking follow-up questions about a paper and want the model to hold the full prior exchange in context — not a 3-message summary of it. At 8k, a long research session overwrites its own early context. At 32k, you can go 20–40 exchanges deep before anything drops.

Long code file review: a 600-line Python file with docstrings and comments is ~6k–10k tokens. You want the full file in context while you ask questions about it, not chunks. 32k gives you room for the file plus a few rounds of Q&A.

Document analysis across multiple pages: a 30-page technical specification or contract runs 15k–25k tokens. 8k forces chunking; 32k lets you ask about relationships between different sections in a single pass.

Agentic coding loops (Aider, Cline): these tools send the full file list, relevant files, and conversation history in each request. Context grows fast. A 32k window allows multi-file edits without hitting the ceiling mid-session.

The hardware minimum for comfortable 32k use with a 13B model: 16 GB VRAM. With flash attention enabled, you can push 32k on a 12 GB card, but generation speed drops as the KV cache grows.

128k: Worth It or Hype?

128k context support is now advertised by most frontier local models — Llama 3.1 8B and 70B, Qwen2.5 7B through 72B, Gemma 3 12B, Mistral models. The capability is real. But three problems limit its practical utility.

Problem 1: VRAM requirements are prohibitive. As the table above shows, running Llama 3.1 8B at 128k context needs ~21 GB of VRAM for that model alone. The 8B model is the small option. Running a 70B model at 128k context requires 80+ GB of VRAM — which means multi-GPU or cloud. For cloud GPU rental, RunPod has H100 80GB instances that handle this, but you’re now spending money.

Problem 2: Generation speed collapses. Even if you have the VRAM, processing 100k+ tokens in the KV cache during generation creates severe latency. At 128k context, even an RTX 3090 drops to 10–20 tokens/second for an 8B model — compared to 45+ tokens/second at 8k context. For interactive use, this is painful.

Problem 3: Quality degrades at scale (see next section).

That said, 128k is genuinely useful for specific narrow cases: analyzing an entire novel or research paper as a single unit, processing a large codebase in one pass for architecture analysis, or loading the full context of a long-running agent session. The quality is better than 8k for these tasks even if it’s not perfect.

The “Lost in the Middle” Problem

The most important thing most context window guides don’t mention: a model’s ability to use all available context is not uniform. Research from Stanford, UC Berkeley, and Samaya AI documented the “lost in the middle” phenomenon — LLMs exhibit a U-shaped attention curve, reliably using information at the beginning and end of context while systematically ignoring the middle.

The effect is material. In multi-document Q&A benchmarks, accuracy dropped by more than 30% when the relevant information was placed in the middle of the context versus at the start or end. For a 128k window, the middle 100k tokens are effectively a dead zone for many models.

The architectural cause: transformer attention uses causal masking (tokens only attend to prior tokens) combined with Rotary Position Embedding (RoPE), which introduces a long-term decay effect that causes de-emphasis of middle-range tokens. It’s not a bug in one model — it’s a structural property of the attention mechanism.

The practical implication: a 32k context window you use well often outperforms a 128k context window where your relevant content sits in the middle. When stuffing documents into context, put the most critical information at the start or end of the context window, not buried in the middle.

Newer models like Qwen2.5 are specifically trained to mitigate this, with architectural changes and training data weighted toward long-context tasks. They perform better at 128k than older models — but the fundamental U-shape doesn’t disappear entirely.

RAG vs Long Context: Choosing the Right Tool

The common mistake: treating “big context window” as a replacement for proper retrieval. They solve different problems.

Use long context when:

You have a single, static document you need to reason across holistically
The task requires understanding relationships between sections of the same document
You’re doing code review where the entire file needs to be present
The document is small enough to fit in 32k or less

Use RAG when:

You have a large document set (10+ documents, or documents too large for the context window)
The documents change over time (RAG indexes update; context windows are request-scoped)
You need to cite specific passages accurately (retrieval surfaces the exact chunk)
You’re building multi-user or production systems where per-request memory cost matters

A 2026 academic analysis (Long Context vs. RAG for LLMs) found that RAG consistently outperforms direct long-context input on 128k benchmark tasks, even when the full-context model “had access” to all the information. The retrieval mechanism’s ability to surface relevant passages outweighed the theoretical advantage of seeing everything at once.

For local setups specifically: RAG is cheaper to run at scale. Injecting 3 retrieved chunks at 8k context is dramatically less expensive than loading an entire document set into 128k context on every query. The RAG architecture deep dive covers the chunking and retrieval design decisions in detail.

The hybrid that works well in practice: use 16k–32k context for the actual LLM, but use RAG to populate that window efficiently with only the most relevant content.

Configuring Context in Ollama and llama.cpp

Ollama

Ollama sets context based on available VRAM by default, but the defaults are conservative. Configure explicitly:

# Global default via environment variable (set before starting ollama serve)
export OLLAMA_CONTEXT_LENGTH=16384
export OLLAMA_FLASH_ATTENTION=1
ollama serve

# Per-model Modelfile
FROM qwen2.5:14b
PARAMETER num_ctx 32768

# Save the Modelfile as a custom model
ollama create qwen14b-32k -f Modelfile

# API request with per-call override
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "options": {
    "num_ctx": 16384,
    "num_predict": 1024
  },
  "messages": [{"role": "user", "content": "Analyze this codebase..."}]
}'

llama.cpp Server

# --ctx-size sets the context window
# --n-gpu-layers sets GPU offload (adjust to VRAM)
./llama-server \
  --model models/llama-3.1-8b-q4_k_m.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 99 \
  --flash-attn \
  --port 8080

The --flash-attn flag is the single most impactful change for context-heavy workloads — enabling it in llama.cpp reduces KV cache memory by roughly 40% on supported hardware, which directly translates to larger achievable context on the same GPU.

If the KV cache overflows VRAM, llama.cpp will offload it to CPU RAM. Generation speed drops from ~40 tok/s to 2–5 tok/s instantly. That cliff is unambiguous in practice — watch for the sudden slowdown as your signal that you’ve exceeded GPU capacity.

When NOT to Use Large Contexts

Long context windows are oversold as a general-purpose solution. Specific cases where you should use a smaller window instead:

Most chat interactions. A back-and-forth conversation about a coding problem rarely needs more than 4k–8k tokens. Setting num_ctx to 128k “just in case” wastes VRAM that could be used for a faster model.

When you need fast iteration. At 128k context on a single GPU, generation speed on a 13B model can drop below 10 tok/s. If you’re using the LLM for rapid back-and-forth editing, that latency destroys the workflow. Use a smaller context window and manage context manually.

Production systems with multiple concurrent users. KV cache per-session memory scales with both context length and concurrent requests. A 32k context window per session at 10 concurrent users needs 10× the per-session KV cache on your GPU. Context size management matters more than it does in single-user desktop setups.

RAG is already solving your retrieval problem. If your AnythingLLM or similar RAG system is already finding the right passages for queries, there’s no benefit to also giving the model a 128k window. The retrieval layer already identified what’s relevant.

Frequently Asked Questions

What’s the actual default context window in Ollama? Ollama’s default is 2,048 tokens in most versions, though some model definitions override this. This is deliberately low to prevent unexpected out-of-memory errors on first run. Bumping it to 8,192 or 16,384 via OLLAMA_CONTEXT_LENGTH is one of the first configuration changes worth making on any system with 8+ GB VRAM.

Does a larger context window make the model smarter? No. Context window size only affects how much input the model can see at once — it has no bearing on the model’s reasoning ability or knowledge. A 7B model with a 128k context window is still a 7B model. For most tasks, spending VRAM on a larger model (13B vs 7B) produces better results than spending it on a larger context window with a smaller model.

Can I run 128k context on an 8 GB GPU? Not reliably. With flash attention and aggressive KV cache quantization, you might push 32k–64k on an 8 GB card with a small model (7B Q4_K_M), but 128k requires ~20+ GB of VRAM for a 7B model. The KV cache alone at 128k exceeds 16 GB for most 7B architectures.

Why does my model sometimes ignore information I gave it in context? The “lost in the middle” problem — information placed far from the start or end of a long context window is reliably under-utilized by transformer attention mechanisms. If you’re injecting critical documents, put the most important sections at the beginning or end of your system prompt, not buried in the middle of the context. This applies to all transformer models regardless of vendor claims about context length.

Which models handle long context best locally? As of mid-2026, Qwen2.5 (7B through 32B) and Llama 3.1/3.3 are the most reliable local models for long-context tasks. Qwen2.5 was specifically trained with long-context benchmarks and shows less degradation past 32k than older models. For practical use, the Qwen2.5 14B at 32k context on a 24 GB GPU is the best price/performance point for long-context work without going to multi-GPU setups.

Sources

Recommended Gear

RTX 4090 24GB — handles 32B models at 32k context; the consumer ceiling for local LLM work
RTX 4070 Ti Super 16GB — 16 GB VRAM; the practical sweet spot for 13B models at 16k–32k context
RTX 3090 24GB — older gen but 24 GB VRAM; strong value for long-context workloads on used market
NVIDIA A100 — 80 GB VRAM; enables 128k context on 70B models; data center card, not for home use

Was this article helpful?