Jun 4, 2026

LocalGPT Setup 2026: Private Document Chat in 10 Minutes

By AIFoss · 12 min read

localgptragprivacyllamaselfhosted

TL;DR: LocalGPT runs a fully offline RAG pipeline — your documents stay on your machine, nothing touches a cloud server, no API key required. Setup takes under 10 minutes if you have Python and a CUDA GPU. The main trade-offs against tools like AnythingLLM are that LocalGPT is single-user and doesn’t persist chat history between sessions.

What you’ll have running after this guide:

LocalGPT ingesting PDF, DOCX, TXT, CSV, and Markdown files from a folder on your machine
Llama 3 8B (or a model of your choice) answering questions about those documents, running entirely offline
A working CUDA or CPU setup with the configuration to swap in a larger model when you need it

Honest take: LocalGPT is the leanest path to private document Q&A for a single user. If you need team access or session history, AnythingLLM does more.

What LocalGPT actually does

LocalGPT is a RAG (retrieval-augmented generation) tool built on LangChain. You drop documents into a source folder, run an ingestion script that chunks and embeds them into a local Chroma vector database, then query those embeddings through a local LLM. Every step runs on your hardware.

The main branch uses HuggingFace model downloads and LlamaCpp for GGUF models — no cloud calls, no telemetry, no internet connection once you’ve downloaded the model. That’s what this guide covers.

There’s also a localgpt-v2 branch that replaces the stack with Ollama as the LLM backend. It’s architecturally cleaner, but as of mid-2026 it’s only tested on macOS and supports only PDF ingestion. The main branch is the stable choice for production use today.

Prerequisites

Before cloning anything:

Requirement	Minimum	Recommended
Python	3.10	3.11
RAM	8 GB	16 GB
VRAM (NVIDIA)	None (CPU fallback)	8 GB+
Disk space	20 GB	40 GB (multiple models)
OS	Linux / macOS / Windows	Ubuntu 22.04

CUDA is optional but matters a lot for practical use. On a CPU, expect 3–8 tokens/sec with a 7B model. On a GPU with 8 GB VRAM you get 25–35 tokens/sec. On a GPU with 24 GB VRAM — an RTX 3090 being the common choice — a 13B GGUF model runs at around 40 tokens/sec.

Check your CUDA version before installing:

nvcc --version
# Also check: nvidia-smi (look for "CUDA Version" in the top-right corner)

You need CUDA 11.8 or 12.x. PyTorch 2.x supports both — you’ll select the right wheel during install.

Step 1: Clone and install

git clone https://github.com/PromtEngineer/localGPT.git
cd localGPT
python -m venv venv
source venv/bin/activate       # Windows: venv\Scripts\activate

Install PyTorch first, matching your CUDA version:

# CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# CPU-only
pip install torch torchvision torchaudio

Then install the project dependencies:

pip install -r requirements.txt

This pulls in LangChain, ChromaDB, sentence-transformers, pdfminer.six, docx2txt, and unstructured — expect 5–10 minutes and roughly 3 GB of packages.

Step 2: Choose and configure a model

LocalGPT runs GGUF models via LlamaCpp. Open constants.py and update these two fields:

MODEL_ID = "TheBloke/Llama-3-8B-Instruct-GGUF"
MODEL_BASENAME = "llama-3-8b-instruct.Q4_K_M.gguf"

The first run downloads the model from HuggingFace (~4.7 GB for Q4_K_M 8B). If your GPU has 24 GB VRAM, the 13B model is a meaningful upgrade:

MODEL_ID = "TheBloke/Llama-3-13B-Instruct-GGUF"
MODEL_BASENAME = "llama-3-13b-instruct.Q4_K_M.gguf"

The 13B Q4_K_M file weighs ~7.3 GB and handles instruction following and long-context reasoning noticeably better. Verify the exact filename on the HuggingFace model card before downloading — names can vary by uploader.

Llama 3 is released under the Meta Llama 3 Community License, which permits commercial use below 700 million monthly active users. For an explanation of how GGUF quantization levels compare, see the quantization guide.

You can also use Mistral-7B or other GGUF models. Any model hosted on HuggingFace in GGUF format works — just update MODEL_ID and MODEL_BASENAME.

Step 3: Add your documents

Drop files into the SOURCE_DOCUMENTS/ folder in the repo root. LocalGPT reads the following formats, dispatching a different loader per file extension:

Format	LangChain loader used
`.pdf`	PDFMinerLoader
`.txt`	TextLoader
`.md`	TextLoader
`.py`	TextLoader
`.csv`	CSVLoader
`.xls`, `.xlsx`	UnstructuredExcelLoader
`.docx`, `.doc`	Docx2txtLoader

Mixed formats work fine — you can have PDFs, DOCX files, and CSVs in the same folder and they all get ingested in one pass.

Then run ingestion:

python ingest.py --device_type cuda

Expected output:

Loading documents from SOURCE_DOCUMENTS
Loading new documents: 100%|██████████| 12/12 [00:08<00:00,  1.47it/s]
Loaded 12 new documents from SOURCE_DOCUMENTS
Split into 1247 chunks of text
Creating embeddings. May take some minutes...
Using embedded DuckDB with persistence: storing vectors in DB

CPU-only:

python ingest.py --device_type cpu

Ingestion time scales with document count and chunk size. For 50 MB of PDFs, expect 2–5 minutes on GPU and 10–20 minutes on CPU. The vector database is stored in DB/. Re-running ingest.py only processes files not already in the DB, so you can add documents incrementally.

Step 4: Run a query session

python run_localGPT.py --device_type cuda

The model loads in 30–60 seconds, then you get a prompt:

> Enter a query:

Type any question about your documents. LocalGPT retrieves the top-k most relevant chunks and passes them to the model along with your question. Add --show_sources to see which files and pages each answer draws from:

python run_localGPT.py --device_type cuda --show_sources

Example session:

> What are the main findings in the Q3 risk report?

Answer: The Q3 report identifies three primary risks: supply chain disruption in APAC
markets, rising material costs impacting gross margin by approximately 2–3%, and
upcoming regulatory changes in the EU affecting product certification timelines.

Source Documents:
../SOURCE_DOCUMENTS/q3_risk_report_2025.pdf (page 4): ...material cost pressures have
intensified in Q2–Q3 2025, with polysilicate pricing up 18%...

GPU vs CPU: actual performance expectations

These numbers are based on community benchmarks for LlamaCpp-based inference with Q4_K_M quantization. Your results will vary with RAM speed, chunk count, and context length.

Hardware	Model	Tokens/sec	Typical response time
CPU (8-core, 32 GB RAM)	Llama 3 8B Q4_K_M	3–8	20–50 sec
RTX 3060 12 GB	Llama 3 8B Q4_K_M	25–35	4–8 sec
RTX 3090 24 GB	Llama 3 13B Q4_K_M	35–45	5–10 sec
RTX 4090 24 GB	Llama 3 13B Q4_K_M	55–70	3–6 sec

CPU inference is workable for occasional queries. For ongoing use with a 13B+ model, GPU memory bandwidth is the decisive factor. If you want to run 70B models without consumer hardware, RunPod rents A100 instances by the hour — you can spin one up, run a batch of queries, and pay a few dollars rather than buying a $2,500 GPU.

Swapping to a different model

Model configuration lives entirely in constants.py. To switch:

# Smaller, faster — good for CPU
MODEL_ID = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
MODEL_BASENAME = "mistral-7b-instruct-v0.2.Q4_K_M.gguf"

# Embedding model (also swappable)
EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"

Restart after any change to constants.py. LocalGPT downloads the new model from HuggingFace on the next run. Downloaded models are cached in ~/.cache/huggingface/ so switching back is instant.

For the embedding model, all-MiniLM-L6-v2 is a solid default — fast, ~80 MB. If you’re working with very technical documents, BAAI/bge-large-en-v1.5 often retrieves more relevant chunks.

Troubleshooting

”no kernel image is available for execution on the device”

This is a CUDA version mismatch — your installed CUDA doesn’t match the PyTorch wheel. Fix it by reinstalling PyTorch with the wheel that matches nvcc --version:

pip uninstall torch torchvision torchaudio -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Use the PyTorch install configurator to pick the exact command for your setup. Don’t guess the CUDA version — verify it with nvcc --version first.

PDF ingestion fails: “PDFInfoNotInstalledError” or “pdfinfo not found”

Some langchain PDF loaders depend on Poppler. Install it for your OS:

# Ubuntu / Debian
sudo apt install poppler-utils

# macOS
brew install poppler

# Windows: download Poppler for Windows, add bin/ to PATH

If you’re using PDFMinerLoader specifically (the default in LocalGPT), you likely don’t need Poppler — this error usually appears when a different loader gets invoked. Check that requirements.txt lists pdfminer.six, not pypdf or pymupdf which have different system dependencies.

CUDA extension warnings from AutoGPTQ

You may see warnings like CUDA extension not installed from AutoGPTQ during model load. LocalGPT typically falls back to a CPU-based implementation and still works — just slower. To get full CUDA acceleration:

pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu121/

Match the cu121 suffix to your actual CUDA version.

Model download stalls or fails

HuggingFace downloads can drop mid-stream on slow connections. Kill the process and re-run — the HuggingFace library resumes from its cache in ~/.cache/huggingface/. If a partial download keeps failing, delete the incomplete file from the cache and restart.

When NOT to use LocalGPT

Multi-user access: LocalGPT has no authentication layer, no user accounts, and no document separation between users. If two people need separate document collections, use AnythingLLM — it handles multi-user workspaces with role-based access.

Persistent conversation history: Every LocalGPT session starts fresh. There’s no stored conversation that carries over. If you need to reference earlier answers or maintain a running thread across days, AnythingLLM and Open WebUI both handle this better.

Non-document workflows: LocalGPT is purpose-built for document Q&A. It doesn’t do web search, tool use, or multi-step agent tasks. For those, look at Open Interpreter or a full agent framework.

Windows production use: LocalGPT works on Windows, but most troubleshooting in the community assumes Linux. Poppler path issues and CUDA DLL problems are more common. Docker deployment sidesteps most of this.

LocalGPT vs AnythingLLM vs Open WebUI

	LocalGPT	AnythingLLM	Open WebUI
Primary use	Solo document Q&A	Team RAG + LLM chat	LLM chat + document attachment
Multi-user	No	Yes	Yes
Persistent history	No	Yes	Yes
Setup complexity	Medium (Python venv + deps)	Medium (Docker)	Low (Docker one-liner)
Document formats	PDF, TXT, DOCX, CSV, MD, XLS, PY	PDF, TXT, DOCX, and more	PDF, TXT
Internet required	No (after model download)	No	No
License	MIT	MIT	MIT

LocalGPT wins on simplicity and complete data isolation. Everything runs in a single Python environment with no containers to maintain. AnythingLLM wins when you need multiple users, persistent sessions, or broader document type support with a richer UI.

FAQ

Does LocalGPT require an internet connection after setup?
Only for the initial model download from HuggingFace. After that, everything is local. The model, vector database, and all embeddings live on your disk.

Can I use LocalGPT with Ollama models?
Not with the stable main branch. The localgpt-v2 branch uses Ollama as its LLM backend, but it’s still in active development and currently only tested on macOS. The main branch uses HuggingFace + LlamaCpp directly.

How many documents can it handle?
ChromaDB scales to thousands of documents without issue. The practical limit is retrieval quality — the more chunks exist, the harder it is to surface the most relevant ones. For very large document sets (10,000+ pages), a dedicated vector database like Qdrant with more retrieval tuning options handles it better.

What if I ask about something not in my documents?
LocalGPT passes your question and the retrieved chunks to the LLM. If no relevant chunks exist, a well-tuned model like Llama 3 Instruct will say it can’t answer based on the provided context. Smaller or less instruction-tuned models sometimes hallucinate — this is a model behavior issue, not specific to LocalGPT.

Does it work without a GPU?
Yes. Use --device_type cpu on both the ingest and run commands. Responses take 20–60 seconds for a 7B model on a modern CPU. Usable for occasional queries; frustrating for iterative research sessions.

Recommended Gear

RTX 3090 24GB — runs Llama 3 13B Q4_K_M at ~40 tok/sec, best value for local RAG at this tier
RTX 4090 24GB — fastest consumer option, ~55–70 tok/sec on 13B models

Sources

PromtEngineer/localGPT GitHub Repository — official source code, README, constants.py DOCUMENT_MAP
LocalGPT Issue #171: Handling ingestion file types — DOCUMENT_MAP and supported file format loaders
LocalGPT Issue #150: CUDA mismatch with AutoGPTQ — CUDA version conflict details and fix
LocalGPT Issue #616: Current build won’t ingest PDFs — PDF loader troubleshooting
PyTorch Get Started Locally — CUDA wheel selection guide
GPU Benchmarks on LLM Inference — XiongjieDai — GPU inference speed reference data
llama.cpp CPU vs GPU inference speed — CPU vs GPU performance comparison benchmarks

Was this article helpful?