RAG Deep Dive 2026: Chunking, Embedding, and Retrieval

ragaillmpythonselfhosted

TL;DR: Retrieval-Augmented Generation is simple to prototype and easy to get wrong in production. The three biggest levers — chunking strategy, embedding model, and retrieval architecture — all interact, and a bad choice at the chunking stage poisons everything downstream. For local setups, the right default stack in 2026 is hierarchical chunking, nomic-embed-text v1.5, and a hybrid BM25 + vector retriever before adding a cross-encoder reranker only when precision becomes your bottleneck.

Dense-onlyHybrid (BM25 + vectors)Hybrid + reranking
Best forPrototypes, uniform proseMixed content, productionHigh-stakes or domain-specific retrieval
Query latency (p50)~5ms~15–25ms~80–150ms
InfrastructureVector DB onlyVector DB + BM25 index+ cross-encoder on CPU/GPU
The catchMisses exact-term queriesMore moving partsLatency grows linearly with candidates

Honest take: Start with hierarchical chunking + nomic-embed-text + Chroma. Add hybrid search and a BGE reranker when retrieval quality is your measured bottleneck — not because the architecture diagram looks more impressive.


What a RAG Pipeline Actually Does

RAG has three distinct phases that most introductions blur together:

  1. Ingestion — parse documents, split into chunks, embed each chunk, store vectors in a database
  2. Retrieval — given a user query, embed the query, find the closest chunks, optionally rerank
  3. Generation — pass the top chunks as context to an LLM, generate an answer

The LLM is the least interesting part of the pipeline. Every significant quality problem in RAG traces back to ingestion or retrieval — usually ingestion. A 2025 analysis of production RAG failures found that over 80% of errors originated at the chunking and indexing stage, not from the LLM hallucinating. Getting chunks right matters more than model size.

The reason this matters for local setups specifically: you can’t compensate for bad retrieval with a bigger model when you’re capped at 8B or 13B parameters. Cloud RAG pipelines mask chunking problems because a GPT-4-class model is good enough to reason across partially relevant context. A Llama 3 8B model is not.


Ingestion: Where Most Pipelines Break

Fixed-Size Chunking

The default in most tutorials: split text every N tokens, with K tokens of overlap. Fast to implement, bad in practice. A 512-token fixed-size chunk on a legal document will split mid-clause. A 512-token chunk on a codebase will split mid-function. Overlap doesn’t fix this — it just duplicates the broken boundary.

Fixed-size chunking works acceptably on one type of content: short, homogeneous paragraphs (FAQs, product descriptions, customer support logs). Everywhere else it degrades retrieval quality.

# Fixed-size: fast but brittle on structured documents
from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_text(document_text)

Semantic Chunking

Split on sentence-similarity boundaries: group sentences with cosine similarity above a threshold, break when similarity drops. This keeps coherent topic sections together and avoids mid-sentence cuts.

Slower at ingestion time (you embed sentences to find boundaries) and requires a sensible threshold. A 2026 systematic evaluation found semantic chunking significantly outperforms fixed-size on long-form prose and technical documentation, with modest gains on conversational data. Use it when your documents are long-form articles, research papers, or mixed-topic reports.

Hierarchical / Parent-Document Chunking

This resolves the fundamental retrieval tension: small chunks find precisely, large chunks provide context for the LLM to reason over.

The mechanism: split each document into small “child” chunks (~150–200 tokens) and larger “parent” chunks (~512–1024 tokens). Embed and index the child chunks. At retrieval time, when a child chunk matches, return its parent to the LLM instead. You get needle-level precision at search, paragraph-level context at generation.

This is the most widely adopted production pattern for 2025–2026 in both LlamaIndex and LangChain deployments. For local setups it requires more preprocessing time but the retrieval quality gain is significant.

# LlamaIndex hierarchical retrieval
from llama_index.core.node_parser import HierarchicalNodeParser
from llama_index.core.retrievers import AutoMergingRetriever

# Three levels: 1024-token root, 512-token parent, 128-token leaf
parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[1024, 512, 128]
)
nodes = parser.get_nodes_from_documents(documents)

Practical guidance: Use fixed-size only for homogeneous short-paragraph corpora. Use semantic chunking for long-form prose. Use hierarchical for any mixed document set where answer quality matters more than ingestion speed.

One counterintuitive finding from a January 2026 systematic analysis: chunk overlap provided no measurable benefit in pipelines using SPLADE retrieval. If you’re running a hybrid pipeline, don’t assume overlap is free — it increases index size and ingestion time for potentially zero recall gain.


Embedding Models: The Local Options in 2026

The embedding model determines how well cosine similarity in vector space corresponds to actual query-document relevance. A mismatch between your domain and the embedding model’s training data produces retrieval failures that look like LLM problems.

nomic-embed-text v1.5

The default recommendation for local setups. 137M parameters, 274MB on disk, 8192-token context window. Runs on CPU without GPU. Scores 62.39 on MTEB (the standard embedding benchmark), which beats OpenAI’s text-embedding-ada-002 at zero API cost.

The v1.5 addition of Matryoshka Representation Learning lets you truncate embeddings to any dimension from 64 to 768 without retraining. At 512 dimensions it outperforms ada-002 while cutting memory usage by 33%. If vector DB storage is a constraint, this is a meaningful lever.

Pull and use via Ollama alongside your chat model:

ollama pull nomic-embed-text
# 274MB, CPU-only inference, 8192 token context

BGE-M3

BAAI’s multilingual model: supports 100+ languages, 8192-token context, Apache 2.0 license. Slightly heavier at 570MB but the best open-source option for non-English corpora. On the MTEB multilingual benchmark it leads its size class.

BGE-M3 also outputs sparse embeddings alongside dense vectors — useful if you want to implement hybrid search within a single model inference rather than a separate BM25 index.

all-MiniLM-L6-v2

22M parameters, 80MB, 512-token context. The “fast on CPU” option. When you’re embedding millions of chunks at ingestion time on CPU hardware and can accept lower recall on long documents, MiniLM runs 4–5× faster than nomic-embed-text. Don’t use it as a default — use it when you’ve profiled ingestion speed as your actual bottleneck.

Embedding Model Comparison

ModelParamsSizeContextMTEB ScoreBest for
nomic-embed-text v1.5137M274MB819262.39General English RAG default
BGE-M3570M570MB8192~65+Multilingual, hybrid search
all-MiniLM-L6-v222M80MB51256.26High-volume CPU ingestion
nomic-embed-text-v2-moe~256M~500MB8192~63+Multilingual MoE variant

For most local RAG setups: nomic-embed-text v1.5 is the right starting point. Switch to BGE-M3 if you need multilingual support or plan to use its sparse output for hybrid search.


The Vector Store Decision

The vector database choice matters less than most tutorials suggest at prototype scale, and quite a bit at production scale.

Chroma runs embedded in your Python process — no separate service, no port to open, trivial to reset between experiments. For development and single-user setups under ~1M vectors it’s the practical default. LangChain and LlamaIndex both have deep Chroma integration with extensive code examples.

Qdrant is the production choice. It adds payload filtering (filter by metadata before vector search, not after), horizontal sharding, and a stable REST API. For a shared team RAG system or a pipeline that needs to filter by document type, date, or department before doing similarity search, Qdrant’s filtering outperforms Chroma’s post-retrieval metadata filtering significantly. Qdrant also has native sparse vector support, which simplifies hybrid BM25 + dense search into a single query.

pgvector deserves consideration for teams already running PostgreSQL: it’s an extension, not a new service. Acceptable performance for under ~500k vectors with simple similarity queries, and zero new infrastructure to operate.

The decision tree: prototype → Chroma; production team deployment → Qdrant; existing Postgres shop under 500k vectors → pgvector. For a deeper comparison of these options with performance numbers, see the Chroma vs Qdrant vs Weaviate comparison.


Retrieval: Dense, Sparse, and Hybrid

The default: embed the query, find the top-k chunks by cosine similarity. Works well when the user’s query is semantically similar to the stored content.

Where it fails: exact-term queries on specialized vocabulary. “What is our policy on FMLA leave?” fails on dense search if your documents say “Family and Medical Leave Act” without the acronym. Dense retrieval finds conceptual matches; it misses lexical ones. This is a significant problem for legal, financial, and technical corpora where exact terminology matters.

Sparse Retrieval (BM25)

BM25 is a term-frequency algorithm that’s been the backbone of search engines for decades. It excels at exact-term matching, runs on CPU with no GPU required, and handles zero-shot domains (financial jargon, legal terms, proprietary product names) that dense models handle poorly.

Its failure mode is the inverse of dense retrieval: “What are the regulations around remote work?” fails BM25 if your documents discuss “distributed workforce policy” without using “remote work.”

BM25 is fast and cheap: it handles billions of documents on commodity hardware with single-digit millisecond query latency, requires no GPU, no embedding model, and no approximate nearest neighbor index rebuilds.

Hybrid Search (BM25 + Dense Vectors)

The production standard. Sparse and dense retrieval fail on opposite query types, so combining them produces consistently higher recall than either alone — across essentially every document domain.

The recommended fusion mechanism is Reciprocal Rank Fusion (RRF): rank both result lists independently, then combine ranks by reciprocal formula. No score calibration, no hyperparameter tuning, robust to domain shift.

Benchmark numbers from a 2026 paper on text-and-table document corpora:

  • BM25 alone: Recall@5 = 0.644
  • Dense alone: Recall@5 = 0.587
  • Hybrid RRF: Recall@5 = 0.695
  • Hybrid RRF + neural reranking: Recall@5 = 0.816

The jump from dense-only to hybrid is consistent. The additional gain from reranking is larger but comes with latency cost (covered next).

LangChain’s EnsembleRetriever and LlamaIndex’s QueryFusionRetriever both implement hybrid retrieval on top of any vector store paired with a BM25 index:

# LangChain hybrid retrieval
from langchain.retrievers import EnsembleRetriever, BM25Retriever
from langchain_community.vectorstores import Chroma

bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 10

vector_retriever = Chroma.from_documents(
    docs, embedding_fn
).as_retriever(search_kwargs={"k": 10})

# EnsembleRetriever applies RRF internally
ensemble = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.5, 0.5]
)

Reranking: The Stage Most Pipelines Skip

Initial retrieval optimizes for recall — cast a wide net, don’t miss relevant chunks. Reranking optimizes for precision — reorder so the most relevant chunks end up in position 1–10, not buried at position 40.

A cross-encoder takes the query and a candidate chunk together, runs full attention over the pair, and produces a single relevance score. This is more expensive than the bi-encoder embedding approach used for retrieval, but produces significantly better relevance ordering.

The standard two-stage architecture:

  1. Hybrid retrieval → top-50 candidates
  2. Cross-encoder reranker → top-10 results
  3. Top-10 passed to LLM as context

For local inference: BAAI/bge-reranker-v2-m3 (Apache 2.0) matches Cohere Rerank quality at zero API cost. At roughly 8ms per query-document pair on CPU, reranking 50 candidates takes ~400ms on a modern CPU. On a GPU the full batch drops under 50ms.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")

# candidates: list of Document objects from hybrid retrieval
pairs = [(query, doc.page_content) for doc in candidates[:50]]
scores = reranker.predict(pairs)

# Sort by score descending, return top 10
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
top_chunks = [doc for doc, _ in ranked[:10]]

Don’t add reranking by default. Add it when: your LLM produces answers citing wrong sections, or when RAGAS context precision scores sit below 0.6. Skip it when: you’re CPU-only and latency matters more than precision, or your document corpus is small enough that dense retrieval already returns correct context.


A Reference Stack for Local RAG

For most local setups — single developer, 10k–500k document chunks, GPU or CPU machine:

LayerRecommendationNotes
ChunkingHierarchical (128 child / 512 parent tokens)LlamaIndex HierarchicalNodeParser
Embeddingnomic-embed-text v1.5 via OllamaCPU-only, 8k context, 274MB
Vector storeChroma (dev) / Qdrant (prod)Both have LangChain + LlamaIndex connectors
RetrievalHybrid BM25 + dense, RRF fusionEnsembleRetriever in LangChain
Rerankingbge-reranker-v2-m3Add when precision is your measured bottleneck
LLMLlama 3 8B via OllamaLlama 3.3 70B for multi-hop reasoning

For teams without local GPU capacity for a 70B model, RunPod provides on-demand GPU inference at significantly lower cost than OpenAI for high-volume RAG workloads.

If you’re choosing between RAG frameworks: LlamaIndex is the better default in 2026 for document-heavy pipelines — it ships hierarchical retrieval, query rewriting, and reranking support without third-party plugins. LangChain is more flexible for multi-step agent workflows that chain retrieval with tool use. For a no-code RAG implementation with a UI, see the AnythingLLM review for a full capability assessment, and the AnythingLLM local RAG setup guide for the full workflow without writing Python.


When NOT to Use RAG

RAG is not the answer to every “I want my LLM to know about X” problem.

Use a longer context window instead when your documents are small enough to fit. With 128k-context models widely available in 2026, loading a 50-page manual directly into context often outperforms a chunked RAG pipeline on coherence and multi-hop reasoning. RAG introduces retrieval failure modes that full-context approaches avoid entirely. If your corpus is under ~100 pages, try full-context first.

Use fine-tuning instead when the knowledge is stable and well-structured, and the LLM needs to internalize reasoning patterns rather than recall specific facts. A customer support bot that needs to sound like your brand and handle product-specific edge cases is a fine-tuning candidate. RAG keeps knowledge external; fine-tuning bakes it into weights. They’re complementary for large, frequently updated corpora.

Don’t use RAG when your corpus changes faster than your indexing infrastructure can keep up. A RAG system on live operational data requires continuous ingestion pipelines with deduplication, chunk invalidation, and re-embedding on updates. If your data changes hourly, you need a proper search integration (Elasticsearch, database full-text search) rather than a batch-indexed vector store.

The honest limitation of local RAG with small models: quality degrades fast on complex multi-hop questions. A Llama 3 8B model will miss implicit references and struggle with reasoning across multiple retrieved chunks when the answer spans different sections. If you’re getting poor answers on questions that require connecting several facts, the bottleneck is the LLM’s reasoning capability — not your retrieval pipeline. The fix is a larger model, not more retrieval complexity. See the quantization guide for how to fit larger models into limited VRAM.


Measuring Whether It’s Working

Don’t skip evaluation. A RAG system that feels like it’s working often isn’t when measured against real questions — especially on the edge cases users actually care about.

The RAGAS framework provides four metrics that cover the full pipeline:

  • Faithfulness (0–1): does the answer stay within what the retrieved context says? Catches hallucination — the LLM inventing facts not present in the chunks.
  • Answer relevancy (0–1): does the answer address the question? Catches tangential or incomplete responses.
  • Context precision (0–1): are the retrieved chunks the right ones? Directly scores your retrieval step.
  • Context recall (0–1): are all relevant facts present in the retrieved chunks? Catches gaps in coverage.

Build a labeled test set of 50–100 representative query-answer pairs before shipping. Context precision below 0.6 means your retrieval is broken. Faithfulness below 0.7 means your LLM is hallucinating past the context. Fixing the retrieval problem first is almost always higher leverage than prompt engineering.

pip install ragas langfuse
# RAGAS integrates with LangSmith and Langfuse for continuous production monitoring

Run evaluations against your test set whenever you change chunking strategy, swap embedding models, or update your document corpus significantly. These changes invalidate previous quality measurements more often than developers expect.


Frequently Asked Questions

What chunk size should I start with for RAG? Start with hierarchical chunking using 128-token child chunks and 512-token parent chunks. This works well across most document types and is the most widely adopted production pattern in 2026. Avoid 512-token fixed-size chunks as a default — they break semantic context at arbitrary token boundaries.

Can I run a complete RAG pipeline without a GPU? Yes. nomic-embed-text v1.5 (274MB) runs on CPU, Chroma runs in-process with no separate service, and BM25 requires no GPU at all. The bottleneck is LLM inference. A 7B quantized model at Q4_K_M runs at 1–3 tokens/sec on a modern CPU — slow but functional for async or low-volume use. For better throughput on CPU-only machines, Llama 3 8B at Q4_K_M is the practical ceiling without a GPU.

How many documents can local RAG handle? Chroma handles up to ~1M vectors reliably in embedded mode. On a machine with 16GB RAM and an RTX 4070, both embedding and retrieval stay fast at that scale. Beyond 1M vectors, switch to Qdrant for better memory management, payload filtering, and horizontal scaling.

What’s the difference between RAG and fine-tuning for adding knowledge to an LLM? RAG injects knowledge at inference time by retrieving relevant chunks at query time. Fine-tuning bakes knowledge into model weights at training time. RAG is better for frequently changing information and exact document lookup. Fine-tuning is better for behavioral changes and reasoning patterns that need to be internalized. They complement each other for large, updatable corpora.

Why does my RAG pipeline answer questions using information not in my documents? The LLM is completing based on parametric knowledge (what it learned during training) rather than the retrieved context. Two fixes: (1) strengthen your system prompt — explicitly instruct the model to “answer only based on the provided context; if the answer is not in the context, say so”; (2) check context precision with RAGAS — if the wrong chunks are being retrieved, the model fills the gap with training data. Both problems often occur together.


Sources

Was this article helpful?