Vector Databases & Embeddings: The Memory Behind Private AI

When a private AI answers from your documents, two quiet pieces do the heavy lifting: embeddings, which turn your text into numbers that capture meaning, and a vector database, which stores those numbers and searches them by meaning instead of by keyword. This is the memory layer behind a private RAG system. Here is what each one actually does, and an honest framework for choosing pgvector, Qdrant, or Chroma — all of which run on hardware you own.

Plan My RAG Stack Call 832-338-2926

What an embedding is (plain English)

An embedding is a list of numbers that captures the meaning of a piece of text. An embedding model reads a sentence — or a paragraph from one of your PDFs — and turns it into a fixed-length list of numbers, often a few hundred to a few thousand of them. The trick is that text with similar meaning produces similar numbers, even when the words are completely different.

So "the invoice is overdue" and "this bill is past due" land close together, while "the invoice is overdue" and "schedule a team lunch" land far apart. The model has placed each passage at a point in a space where distance means difference in meaning. Once your documents live in that space as numbers, a computer can find the passages closest to a question — search by meaning, not by matching exact words. That is the whole foundation that makes a private assistant able to answer from your files.

What a vector database does — store, search, filter

A vector database is built to do three things well with those embeddings:

Store — hold millions of embedding vectors alongside the original text chunk and its metadata (which document, which page, which date).
Search by meaning — take a question, embed it, and return the closest chunks fast, using an approximate-nearest-neighbor index (commonly HNSW) so it stays quick even on a large collection.
Filter — narrow the search by metadata, so a query only looks inside one client's folder, one document type, or a date range.

That retrieval step is the engine room of RAG: pull the most relevant chunks, hand them to the local model, and get a grounded, citable answer. Strong metadata filtering is also what keeps a shared assistant from leaking one team's documents into another team's results — closely related to the private, on-premise story the whole stack is built on.

pgvector vs Qdrant vs Chroma — when to pick which

All three are open-source and self-hostable, so any of them can run on a server you own. They differ in how they fit your existing stack and how far they scale. Scale figures below are community/vendor estimates — directional, not guarantees, and worth verifying against your own data.

Vector store	Setup	Scale ceiling (approx.)	Filtering	Best for
pgvector	Extension on existing PostgreSQL	Comfortable into the low millions of vectors; verify per workload	Full SQL — filter and JOIN against your app data	Already on Postgres; one database for app + AI search
Qdrant	Dedicated service (Docker/binary)	Designed for large, growing indexes with quantization	Rich payload filtering built for vector search	Larger or fast-growing collections; strong price-performance self-host
Chroma	Lightweight, embeds in your app	Best for smaller sets and prototypes	Metadata filtering, simple API	Prototyping and smaller datasets; fast to start

Scale ceilings are approximate community/vendor estimates and depend on vector dimensions, filtering, and hardware. Re-verify against your own collection before treating any threshold as fixed.

The honest short version

If you are already running PostgreSQL, start with pgvector — keeping one database for both your application data and your AI search removes a whole moving part, and it handles most small-business document collections without breaking a sweat. Some comparisons suggest pgvector starts to strain in the low millions of vectors on a single instance, but that is an approximate, hardware-dependent figure to verify, not a hard wall.

Reach for Qdrant when the index is large or growing fast, you need heavy metadata filtering, or you want quantization to keep a big collection on affordable hardware. Pick Chroma when you want to prototype quickly or the dataset is small. There is no universally "best" choice — only the right one for your data volume and the systems you already run.

Self-hosted embedding models compared

The database stores the vectors; an embedding model creates them. These open models all run locally, so the text being embedded never leaves your network. Specs move quarter to quarter — confirm exact params, dimensions, and context on each model card at build time.

Model	Size (approx.)	Dimensions	Notable	Best for
nomic-embed-text	~100–140M params	768 (Matryoshka, can shrink)	Long context; CPU-feasible	A strong, light default; runs without a GPU
bge-m3	Larger; benefits from a GPU	1024	Multilingual; dense + sparse retrieval	Multilingual collections, hybrid search
mxbai-embed-large	Mid-size	1024	Balanced size vs quality	When you want more quality than a tiny model
all-mpnet-base-v2	Small, mature	768	Long-standing English baseline	Simple English-only sets; well-understood

Parameter counts, dimensions, and context lengths above are approximate and current to 2025–2026; verify on each model's card before deploying. A practical rule: the embedding model you index with and the one you query with must match.

Quantization & scale — staying on hardware you own as you grow

Vectors take up memory. A collection that is comfortable at tens of thousands of chunks can get heavy at tens of millions, and the natural temptation is to rent a managed cloud service to absorb it. Quantization is what lets you avoid that and keep the index on your own box. It compresses each stored vector to use far less memory — for example with binary or scalar quantization — usually with a small, manageable hit to retrieval quality that good systems claw back by re-scoring the top candidates at full precision.

Qdrant in particular leans on quantization to keep large indexes on affordable hardware, and publishes specific compression figures (for instance, how few bytes a high-dimension vector can shrink to under binary quantization). Those exact byte numbers are vendor figures — we treat them as directional and verify against current Qdrant documentation before quoting them to a client, rather than repeating a precise claim that may have changed.

The practical upshot for an owner: growing your document set does not have to mean moving your data to someone else's cloud. The right combination of database, index, and quantization keeps a large, fast collection running on a local LLM server you control.

How we choose the stack for your data volume

There is no default we paste onto every project. We size the memory layer to your real documents by working through a short checklist.

1. How many documents, and how fast are they growing?

A few thousand files and a few million words point one way; tens of millions of chunks with steady growth point toward a dedicated store with quantization.

2. Are you already running PostgreSQL?

If so, pgvector keeps one database for app data and AI search — fewer moving parts, easy SQL filtering. We only add a separate store when scale justifies it.

3. How much filtering do you need?

Per-client, per-document-type, or date-range filtering at scale favors Qdrant's payload filtering; lighter needs are fine in pgvector or Chroma.

4. One language or many?

Multilingual collections push us toward a model like bge-m3; English-only sets do well on a lighter model that can even run on CPU.

5. GPU or CPU for embedding?

A CPU-feasible model like nomic-embed-text keeps the bill of materials down; heavier models earn a GPU when quality demands it.

6. Where must the data live?

Always on hardware you own. Every option here is self-hosted, so the text is embedded and searched inside your building, never sent to a third party.

This is the memory layer underneath the wider knowledge stack — pair it with RAG for business, a private AI chatbot, and document automation to put it to work.

We pick the database and the model, then build it here in Texas

You should not have to settle the pgvector vs Qdrant vs Chroma question on your own. We size the memory layer to your real document volume, pick the embedding model to match, and install it on a server you own — on-site across Houston, Katy, Sugar Land and the Fort Bend area, then stay on call. See our Texas service areas.

Vector database & embedding questions

What is an embedding, in plain English?+

An embedding is a list of numbers that captures the meaning of a piece of text. Two passages that mean similar things end up with similar numbers, so a computer can find related text even when the exact words differ — that is what makes search by meaning possible.

What does a vector database actually do?+

It stores embeddings and searches them by meaning instead of by exact keyword. When a question comes in, it turns the question into an embedding and returns the chunks of your documents whose embeddings are closest — the retrieval step that grounds a RAG answer in your data.

Which vector database will you use for us — pgvector, Qdrant, or Chroma?+

It depends on scale and what you already run. If you are already on PostgreSQL, pgvector keeps everything in one database and is plenty for most small-business collections. For larger or fast-growing indexes that need strong filtering and memory-saving quantization, we reach for Qdrant. Chroma is great for prototyping and smaller sets. We pick based on your data volume, not a default.

When is pgvector not enough?+

pgvector is excellent until your collection grows large enough that index size and query latency start to strain a single Postgres instance. Some community comparisons put that loose threshold in the low millions of vectors, but it is approximate and depends on dimensions, filtering, and hardware — verify against your own data before assuming a number. Past that point a dedicated store like Qdrant usually scales more comfortably.

Can the vector database and embedding model run on hardware we own?+

Yes — that is the whole point. pgvector, Qdrant, and Chroma are all open-source and self-hostable, and open embedding models like nomic-embed-text, bge-m3, and mxbai-embed-large run locally. The text never has to leave your building to be embedded or searched.

Does quantization hurt search quality?+

Quantization compresses the stored vectors so a large index uses far less memory and stays on affordable hardware. There is usually a small, manageable quality trade-off, and good implementations re-score the top candidates at full precision to recover most of it. Qdrant publishes specific compression figures for binary quantization; treat any exact byte numbers as vendor figures to verify against current docs before quoting.

Next, see how the memory layer fits the whole pipeline in RAG for business, or back to Business Automation.

Not sure which vector database your data needs?

Tell us how many documents you have and how they grow — we'll pick the database, the embedding model, and the hardware, and build a private memory layer you own outright.