Jun 2, 2026

pgvector vs Chroma vs Qdrant for Local RAG 2026

By AIFoss · 12 min read

ragvector-databasepgvectorchromaqdrantselfhosted

TL;DR: If you already run PostgreSQL, pgvector (v0.8.2) is the right call — it’s one SQL command away and handles 10M vectors comfortably. Chroma (1.5.9) exists for rapid prototyping and nothing else; it falls apart under concurrent load. Qdrant (v1.17.1) is the choice when you need native payload filtering, scalar quantization, or sub-20ms p95 latency at scale.

	pgvector	Chroma	Qdrant
Best for	Existing Postgres stacks, <10M vectors	Prototypes, scripts, notebooks	Production RAG with filtering, >1M vectors
Setup cost	One `CREATE EXTENSION` if Postgres exists	`pip install chromadb`, 3 lines	One Docker command + client lib
The catch	Performance degrades above 50M vectors	Memory blowup, no sharding, prod risk	New infra to manage; overkill for small datasets
Scalar quantization	No (requires pgvectorscale add-on)	No	Yes — 4× memory savings
Filtered search	WHERE clause (post-filter, slower)	Python-side filtering	Native payload index (pre-filter, fast)

Honest take: Use pgvector if you have Postgres. Use Qdrant if you’re building anything that will see real users. Use Chroma only if you’re experimenting in a notebook and know you’ll replace it.

The Setup Reality

Every vector database tutorial starts with a pip install or docker run and immediately goes into embedding code. What they skip is the maintenance burden you’re signing up for.

pgvector is a PostgreSQL extension. If you already operate Postgres for your application, adding pgvector means one SQL command and no new infrastructure. Your existing backup strategy, connection pooling, monitoring, and access control all carry over. If you don’t have Postgres, you’re now standing up a relational database just to use it as a vector store — which rarely makes sense.

Chroma is a Python library that ships a lightweight embedded server. Zero configuration, zero ports to open, data persists to disk via SQLite + HNSW files. For a developer who wants to test embedding strategies in an afternoon, it genuinely is the fastest path from idea to working code.

Qdrant is a standalone vector database written in Rust. It runs as a separate process (Docker is the recommended path), exposes REST and gRPC APIs, and requires a client library. That’s one more moving part than pgvector and one more process to manage. In exchange, you get a purpose-built engine with features the other two simply don’t have.

pgvector v0.8.2: Zero New Infrastructure

pgvector v0.8.2 was released in February 2026, patching CVE-2026-3172 (a buffer overflow during parallel HNSW index builds that could leak data or crash Postgres). If you’re running an older version, upgrade before building any HNSW index in parallel.

Installation on an existing Postgres instance:

-- On Ubuntu/Debian with Postgres 16
sudo apt install postgresql-16-pgvector

-- Then in psql:
CREATE EXTENSION IF NOT EXISTS vector;

-- Create a table with a 1536-dim embedding column (OpenAI ada-002 / text-embedding-3-small)
CREATE TABLE documents (
    id     bigserial PRIMARY KEY,
    content text,
    embedding vector(1536)
);

-- Create HNSW index (recommended over IVFFlat for most workloads)
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

Expected output after CREATE INDEX: the build runs in a background vacuum worker if you use CREATE INDEX CONCURRENTLY, or blocks the table if not. For 1M rows at 1536 dimensions, expect the index build to take 15–45 minutes on a mid-range server with maintenance_work_mem = 4GB.

The HNSW index memory trap: a 1M-row, 1536-dim HNSW index requires roughly 8–12GB of RAM to query efficiently. If your server doesn’t have that in free memory, Postgres will page the index from disk and your query latency jumps from ~30ms to 300ms+. Set SET LOCAL hnsw.ef_search = 40; before queries to trade recall for speed when memory is tight.

When pgvector wins: your team already knows SQL, your data lives in Postgres, and you’re not crossing the 50M vector threshold. Under that limit, HNSW in pgvector is competitive with purpose-built vector databases. The Timescale team’s benchmarks with pgvectorscale (an optional add-on) show 471 QPS at 99% recall on 50M vectors — though vanilla pgvector without pgvectorscale is slower.

pgvector’s hard limits: above 50M vectors, expect index builds to take 2+ hours and p95 query latency to drift above 200ms. There’s no native quantization — every vector is stored as float32. Filtered similarity search (WHERE clause + <=> operator) performs post-filtering on HNSW results, which degrades recall significantly when your filter is selective.

Chroma 1.5.9: The Prototype Machine

Chroma 1.5.9 (May 2026) is the fastest way to get a RAG pipeline working. Three lines of Python:

import chromadb

client = chromadb.PersistentClient(path="./chroma_data")
collection = client.get_or_create_collection("docs")

# Add documents (Chroma can call your embedding model or accept pre-computed vectors)
collection.add(
    documents=["Self-hosted AI runs on your hardware", "No data leaves your machine"],
    ids=["doc1", "doc2"]
)

# Query
results = collection.query(query_texts=["local inference"], n_results=5)
print(results["documents"])
# [['Self-hosted AI runs on your hardware', 'No data leaves your machine']]

That works. It persists to disk. You can add metadata filters. For a weekend project or internal tool under 100k documents, Chroma is genuinely fine.

The problems start when you move beyond that:

Memory: Chroma stores vectors as float32 with no native quantization option. A collection of 10 million 1536-dimension vectors occupies roughly 57GB of RAM (10M × 1536 × 4 bytes). That number isn’t a gotcha — it’s straightforward float math. Qdrant’s INT8 scalar quantization brings the same dataset to ~15GB.

Single-process architecture: as of 1.5.9, Chroma has no sharding and no multi-node support. Concurrent queries compete for the same Python process. Community reports from production deployments describe memory leaks and crashes under sustained load. Chroma’s own documentation recommends using the embedded mode for “development and testing.”

When Chroma is right: notebooks, scripts, local experiments, RAG demos you’re showing a colleague. The moment your app goes to more than one concurrent user, or your document count crosses 500k, reconsider. For something bigger, see AnythingLLM’s multi-user RAG setup or migrate to Qdrant.

Qdrant v1.17.1: Production-First Design

Qdrant v1.17.1 (March 2026) is the most mature local vector database for production RAG. The core setup:

# Pull and run — data persists to ./qdrant_storage
docker run -d \
  --name qdrant \
  --restart unless-stopped \
  -p 6333:6333 \
  -p 6334:6334 \
  -v $(pwd)/qdrant_storage:/qdrant/storage \
  qdrant/qdrant:v1.17.1

Then from Python using the official client:

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient("localhost", port=6333)

client.create_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    # INT8 scalar quantization: ~4x memory savings, <1% recall loss
    quantization_config={
        "scalar": {"type": "int8", "always_ram": True}
    }
)

client.upsert(
    collection_name="docs",
    points=[
        PointStruct(
            id=1,
            vector=[0.1] * 1536,  # your actual embedding here
            payload={"source": "manual.pdf", "page": 3, "author": "alice"}
        )
    ]
)

# Filtered search: only docs from alice, on page > 2
results = client.search(
    collection_name="docs",
    query_vector=[0.1] * 1536,
    query_filter={"must": [{"key": "author", "match": {"value": "alice"}}]},
    limit=5
)

Payload filtering is Qdrant’s single biggest advantage over pgvector. It indexes metadata fields as a separate payload index, so filtered queries use pre-filtering rather than filtering on top of HNSW results. For a RAG app where users can only see their own documents, or where you filter by document type or date range, this matters. With pgvector’s WHERE-clause post-filtering, narrow filters can drop recall from 95% to 60%.

Scalar quantization in production: enabling INT8 quantization (always_ram: true) keeps the quantized vectors in memory while the full-precision originals page to disk for re-scoring. At 1M vectors × 1536 dimensions, this takes you from ~6GB RAM for the index to ~1.5GB — the difference between needing a 32GB server and a 16GB server.

Distributed mode: Qdrant supports horizontal scaling via Raft-based consensus, replicated collections, and per-shard writes. You won’t need this for most local RAG deployments, but it means Qdrant scales to the same infra without rewriting your application layer.

When Qdrant is right: any app with real users, multi-tenant document access, metadata-heavy filtering, or more than 1M vectors. Also the right choice when you don’t have Postgres and don’t want to stand up a full relational database just for vector storage.

Head-to-Head Benchmarks

The following numbers are from community benchmarks across multiple 2025–2026 sources, using OpenAI text-embedding-3-small (1536 dimensions) on a dataset of 100k documents, running on an 8-core server with 32GB RAM.

Metric	pgvector (HNSW)	Chroma	Qdrant (INT8)
Setup time	~5 min (if Postgres exists)	~2 min	~3 min (Docker)
Ingestion, 10k docs	~20s (includes WAL + index update)	~5s	~8s
Ingestion, 100k docs	~4–8 min	~45–90s	~60–90s
Query p95, simple cosine	~35ms	~45ms	~15ms
Query p95, filtered	~60ms (post-filter)	~90ms (Python-side)	~22ms (pre-filter)
RAM at rest, 100k × 1536d	~600MB (float32 + index)	~620MB (float32)	~160MB (INT8 quant)
Disk footprint, 100k docs	~900MB (Postgres + WAL)	~560MB (SQLite + HNSW files)	~210MB (compressed segments)
Concurrent query support	Postgres MVCC (excellent)	Single-process (poor)	Per-collection locks (good)
Max tested scale (reliable)	~50M vectors	~500k vectors	100M+ vectors

Sources: Nirant Kasliwal’s 1M OpenAI benchmark, callsphere.ai 2026 vector DB benchmarks, Qdrant official benchmarks at qdrant.tech/benchmarks, Crunchy Data pgvector HNSW writeup. Numbers are approximate; your hardware and workload will differ.

The filtered query gap is the one that catches teams off guard. Going from 35ms to 60ms (pgvector) versus 15ms to 22ms (Qdrant) looks like a small absolute difference. At 200 concurrent users, that difference compounds into a wall.

What Each One Fails At

pgvector does not handle:

Datasets above 50M vectors (index builds become multi-hour operations)
High-cardinality metadata filtering (recall drops with selective WHERE filters on HNSW)
Memory-constrained environments (HNSW indexes must fit in RAM to perform)
Teams without Postgres operational experience

Chroma does not handle:

Concurrent production traffic (single-process, reports of memory leaks under load)
Large datasets (10M vectors requires 57GB RAM in float32 with no quantization path)
Multi-user access control (no auth at the collection level in the open-source version)
Any horizontal scaling requirement

Qdrant does not handle:

Structured relational queries (you still need a regular database for your app data)
Extremely simple use cases (adding Docker to your stack for 5k documents is overkill)
In-process embedding (you call your embedding model separately; Qdrant stores vectors, not documents in the Chroma sense)
Teams unwilling to manage a separate service

For a full breakdown of the retrieval architectures behind all of these, see the RAG Architecture Deep Dive.

The Decision Algorithm

You have PostgreSQL running already → use pgvector. The zero-infra argument is real. Most RAG apps don’t need 100M vectors or sub-20ms filtered search. If your load grows, you can add pgvectorscale or migrate to Qdrant later.

You’re building a prototype, demo, or internal script → use Chroma. Fast iteration, no server to manage, runs in a Jupyter notebook. Swap it out before going to production.

You’re building a multi-tenant app, need metadata filtering, or are dealing with more than 1M docs → use Qdrant. The payload index and quantization features justify the extra Docker container.

You need GPU-accelerated inference alongside the vector store (e.g., running your own embedding model on a local GPU): the choice between these databases doesn’t change, but you’ll want to look at GPU server options on runaihome.com for hardware specs that support both the inference workload and the database memory requirements.

If you’re comparing how the RAG framework layer (LangChain, LlamaIndex, Haystack) integrates with these databases, that’s covered separately in the LangChain vs LlamaIndex vs Haystack comparison.

FAQ

Can pgvector replace Qdrant for a multi-tenant RAG app? Only if your filter cardinality is low and your dataset stays under 10M vectors. The WHERE-clause post-filtering on pgvector HNSW means selective filters (e.g., WHERE user_id = 12345) substantially reduce recall because the HNSW graph traversal happens before the filter is applied. Qdrant’s payload index handles this at the index level and maintains recall.

Is Chroma safe for production? Not for concurrent multi-user traffic. Chroma’s own documentation recommends the embedded mode for development. The server mode is more stable in 1.5.x than it was in 0.x, but it still lacks sharding, native auth, and horizontal scaling. Use it for internal tools where a crash and restart is acceptable, not for user-facing products.

Does Qdrant require a GPU? No. Qdrant is a pure vector database — it stores and retrieves embeddings, but doesn’t generate them. You generate embeddings with your model (which may use a GPU), then store the float vectors in Qdrant. The Qdrant process itself runs comfortably on CPU-only servers.

How does pgvector compare to dedicated vector databases at 1M vectors? Competitive, but with caveats. With an HNSW index, pgvector at 1M vectors delivers p95 latency in the 30–50ms range, which is close to Qdrant’s 15–30ms range. The gap widens significantly at 10M+ vectors and for filtered queries. The Timescale pgvectorscale extension improves throughput substantially at 50M vectors but requires an additional dependency.

Which one should I start with if I’m new to local RAG? Chroma for your first prototype — it’s the least friction path to a working RAG pipeline. Once you understand what you’re building and how much data you’re working with, switch to pgvector or Qdrant based on the criteria above. Both have strong Python client libraries and integrate with LangChain and LlamaIndex.

Sources

Was this article helpful?