May 21, 2026

AnythingLLM RAG Setup: Chat With Your Documents Offline

By AIFoss · 12 min read

anythingllmairagselfhostedllm

Most document chat tools either leak your data to a cloud API or require so much infrastructure wiring that you’d rather just grep the PDF yourself. AnythingLLM is the exception — it ships as a single app with a built-in vector database, handles chunking and embedding automatically, and gives you a ChatGPT-style interface over your own documents in under 20 minutes.

The setup below uses AnythingLLM v1.12.1 (released April 22, 2026) with Ollama as the local LLM and embedding backend. If your hardware is modest or you want to pair it with a cloud provider instead, there’s a section for that too.

What AnythingLLM Actually Does

RAG — retrieval-augmented generation — means your LLM doesn’t have to memorize your documents. Instead, when you ask a question, the app searches a vector index of your documents for the most relevant chunks, then sends those chunks plus your question to the model. The model reads the context and answers with citations.

AnythingLLM wraps that entire pipeline — ingest, chunk, embed, store, retrieve, generate — into a single self-hosted web app. You upload files, it does the rest. No Python scripts, no FAISS index management, no custom LangChain chains.

License: MIT. Source on GitHub (60.4k stars as of May 2026).

What You Need

The app itself

AnythingLLM’s own footprint is modest:

Requirement	Minimum
RAM	2 GB (app alone)
CPU	2-core (any modern x86 or ARM)
Storage	5 GB (app + vector index)
OS	Windows 10+, macOS 12+, Ubuntu 20.04+

The catch is the LLM. If you’re connecting to OpenAI or Claude, your hardware doesn’t matter — inference happens in the cloud. If you want fully offline operation with a local model via Ollama, you need:

Model size	VRAM needed	Minimum RAM
3B (e.g., Phi-3 Mini)	4 GB GPU	8 GB system RAM
7B (e.g., Llama 3.1 8B, Mistral 7B)	8 GB GPU	16 GB system RAM
13B (e.g., Llama 3 13B)	12–16 GB GPU	24 GB system RAM
70B (e.g., Llama 3.3 70B Q4)	40–48 GB GPU	64 GB system RAM

CPU-only inference is possible but expect several seconds per token on 7B models — fine for document search, painful for long-form conversations. If you don’t have a capable GPU and still want local-quality privacy, RunPod lets you spin up a GPU instance per-hour for heavier workloads.

Two Ways to Install

Option A: Desktop App

The easiest path. Download the installer for your platform from anythingllm.com — Windows (.exe), macOS (.dmg), or Linux (.AppImage). Open it, run through the 5-step setup wizard, done. The desktop app is single-user only but has no other limitations.

Option B: Docker (Recommended for Servers and Multi-User)

Docker unlocks multi-user workspaces, API access, and permission controls. It’s the better choice if you’re sharing the instance across a team or running it headless on a home server.

# Create a persistent storage directory
export STORAGE_LOCATION=$HOME/.anythingllm
mkdir -p $STORAGE_LOCATION

# Pull and run
docker pull mintplexlabs/anythingllm:master

docker run -d \
  -p 3001:3001 \
  --cap-add SYS_ADMIN \
  -v $STORAGE_LOCATION:/app/server/storage \
  -v $STORAGE_LOCATION/.env:/app/server/.env \
  -e STORAGE_DIR="/app/server/storage" \
  --name anythingllm \
  mintplexlabs/anythingllm:master

Open http://localhost:3001 in your browser. The first run shows an onboarding wizard.

A few notes on the Docker flags:

--cap-add SYS_ADMIN is required for the web scraper feature. Drop it if you don’t need scraping.
The .env volume mount persists your API keys and configuration across container restarts.
There’s no separate database container — AnythingLLM uses SQLite and LanceDB embedded in the storage directory.

Step 1: Pick Your LLM Provider

The setup wizard asks you to select a provider. Your options:

Local (Ollama) — best for privacy: Select “Ollama” and enter http://host.docker.internal:11434 (Docker) or http://localhost:11434 (desktop). AnythingLLM will pull the model list from Ollama automatically.

If Ollama isn’t installed yet, follow the Ollama setup guide first — it’s a 2-minute install.

Cloud providers (OpenAI, Anthropic Claude, Groq, etc.): Enter your API key. All responses go through the provider’s servers. This is the right call if you’re on a laptop with integrated graphics and still want capable responses.

Hybrid: You can change providers per workspace. Use Claude for complex research queries, Ollama for quick lookups on sensitive docs.

Step 2: Set Up Your Embedding Model

This is the step most tutorials gloss over. The embedding model converts your document text into vectors — it’s entirely separate from the chat model and runs on every document upload. Getting this right matters.

For fully local setups, the recommended pairing is:

Provider: Ollama
Model: nomic-embed-text

Pull it first in your terminal:

ollama pull nomic-embed-text

In AnythingLLM’s Settings → Embedding Preference, set provider to “Ollama” and model to nomic-embed-text. This model was trained specifically for document retrieval and handles general English text well at 768 dimensions.

Critical rule: Don’t switch embedding models after you’ve uploaded documents. The stored vectors become incompatible and your searches will return garbage. Pick a model and stick with it, or plan to re-ingest everything if you change.

If you’re using OpenAI as your LLM, you can also use OpenAI’s text-embedding-3-small as the embedding model — it’s fast, accurate, and cheaper than the full API calls.

Step 3: Create a Workspace

A workspace is an isolated RAG context. Documents uploaded to Workspace A are not visible to Workspace B, and each workspace gets its own system prompt, model settings, and context window behavior.

Create one via New Workspace (top left). Give it a descriptive name: “Legal Contracts 2025” or “API Docs — Python SDK” beats “Workspace 1”.

Each workspace has its own settings (gear icon):

LLM Model: override the default for this workspace
Chat Mode: “Query” only retrieves from your documents; “Chat” uses the documents as context but also draws on model knowledge
Context Window: how many document chunks to include per query (default: 4–6; raise to 10–12 for models with large context windows like Claude or GPT-4o)

Use “Query” mode for strict document lookup (e.g., legal or medical docs where you don’t want hallucination). Use “Chat” mode for research workflows where you want the model to synthesize across both the document and its own knowledge.

Step 4: Upload Your Documents

Drag files into the document upload area, or use the embedded connectors for:

Local files: PDF, DOCX, TXT, Markdown, CSV, XLSX, PPTX, HTML, 50+ code file types
Web scraper: paste a URL, it fetches and parses the page
GitHub repo: pull all source files from a repository
YouTube: pastes the transcript as a document
Confluence: scrape a Confluence space directly

After upload, you see the files listed in the document panel. Click the toggle next to a file to “embed” it — this is when the chunking and embedding actually runs. You can upload files without embedding them, which lets you batch-select which ones go into a workspace.

Chunking defaults: Text is split into ~1000-character chunks with an overlap to preserve context across chunk boundaries. For most documents this works fine. Dense technical documentation or legal contracts with complex cross-references may benefit from reducing chunk size to 500–700 characters via Settings → Vector Database → Chunk Settings.

Watch the status bar — v1.12.1 added streamed embedding progress so you can see each chunk being processed in real time instead of staring at a spinner.

Step 5: Start Chatting

Once documents are embedded, open the workspace chat. Ask natural language questions:

What are the cancellation terms in the 2024 SaaS agreement?

AnythingLLM retrieves the relevant chunks and passes them to the LLM. Responses include citations showing which document and chunk the answer came from — click the citation to see the exact source text. This citation trail is what separates a trustworthy document query from a hallucination-prone general-purpose chat.

A few usage patterns that work well:

Code repos: embed the full source tree, then ask “where is the authentication middleware defined?” or “show me all uses of the Config class.”
Research papers: embed PDFs, ask “what did [Author 2023] say about transformer attention complexity?”
Runbooks: embed your team’s ops docs, ask “what’s the rollback procedure for the payments service?”
Contracts: use Query mode, ask “does this agreement include an auto-renewal clause?”

Tune Your RAG for Better Accuracy

The default settings work for general use, but if answers feel incomplete or off-topic, adjust these:

Top-K context chunks (Workspace Settings → Context Window):

Default: 4–6 chunks
For long-form summaries or detailed research: raise to 10–12
For precise factual lookup: lower to 2–3 to reduce noise

Chunk overlap (Settings → Vector Database):

Increase overlap if your documents have continuous numbered lists or tables where context bleeds across chunks
Decrease if retrieval speed matters more than completeness

Similarity threshold: If unrelated content keeps appearing in responses, raise the similarity threshold to require closer matches before including a chunk.

What’s New in v1.12

The v1.12 release (April 2026) added features that push AnythingLLM past pure document chat:

Automatic Agent Mode: models that support native tool calling can now activate agents without requiring @agent prefixes. Llama 3.1 and above work well.
App Integrations: Gmail, Outlook, and Google Calendar can be connected as document sources — useful for turning your email threads into a searchable knowledge base.
Filesystem Agent: the agent can browse your host machine’s file system (scoped to an allowed directory) to pull in context from local files without explicit upload.
Telegram Bot: connect your AnythingLLM instance to Telegram for text, voice, and image queries from your phone.

The agent features are worth exploring once your RAG setup is stable, but they’re not required for basic document chat.

When NOT to Use AnythingLLM

Real-time or live data: AnythingLLM works from static snapshots. If you need to query a live database or current web search results, you need a different tool — either a custom agent pipeline or something like Open WebUI with tool use.

Very large document collections (500k+ pages): The embedded LanceDB is solid for typical use, but large-scale production ingestion pipelines want purpose-built vector infrastructure. At that scale, wire AnythingLLM to an external Qdrant or Weaviate instance instead.

Structured data queries: AnythingLLM isn’t a substitute for SQL. If you want to query a CSV for aggregate statistics (“total sales by region in Q3”), a database is faster and more accurate. AnythingLLM is for semantic search and synthesis, not structured lookups.

Minimal hardware without cloud fallback: CPU-only inference on a 7B model is slow enough to be frustrating for iterative research sessions. Either pair it with a cloud LLM provider or plan for GPU hardware. See the hardware guide on runaihome.com for GPU recommendations for local AI.

AnythingLLM vs. The Alternatives

Feature	AnythingLLM v1.12	Open WebUI (RAG)	PrivateGPT	Flowise
Setup difficulty	Low (wizard)	Medium	Medium	Medium-High
Built-in vector DB	Yes (LanceDB)	Yes (ChromaDB)	Yes	No (external)
Multi-user support	Yes (Docker)	Yes	No	Yes
Document formats	15+ types + connectors	PDF, TXT, MD	PDF, TXT	Plugin-based
Agent mode	Yes (v1.12)	Limited	No	Yes (visual)
API access	Yes	Yes	Yes	Yes
Local LLM support	Yes (Ollama, LM Studio, etc.)	Yes (Ollama)	Yes (llama.cpp)	Yes
License	MIT	MIT	Apache 2.0	Apache 2.0

If you’ve already read the AnythingLLM review and want more context on how it stacks up against Open WebUI and PrivateGPT, the full three-way comparison has a deeper breakdown.

The short version: AnythingLLM is the best default choice if you want document chat without infrastructure work. Open WebUI is better if you primarily want a local ChatGPT interface and happen to also need light RAG. Flowise is better if you need to build custom pipelines with branching logic rather than simple Q&A.

Quickstart Checklist

Install Ollama and pull llama3.1:8b + nomic-embed-text
Start AnythingLLM (desktop or Docker on port 3001)
Set LLM provider → Ollama → llama3.1:8b
Set embedding provider → Ollama → nomic-embed-text
Create a workspace
Upload documents and click the embed toggle
Ask your first question in Query mode

The whole sequence takes 15–20 minutes on a machine with Ollama already running. If you hit the Ollama connection error in Docker, the fix is almost always replacing localhost with host.docker.internal in the Ollama URL field.

1V1 PLAYBOOK · LOCAL LLM

Cut your local AI bill from $400/month cloud GPU to $47/month at home.

4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.

Get it for $19 (early bird) →

Sources

Was this article helpful?