pgvector + PostgreSQL: Production-Ready Semantic Search Without a Vector DB

Technical PM + Software Engineer

Topics:AIBackend

Tech:TypeScriptNode.jsPostgreSQL

Most teams jump to a dedicated vector database the moment someone says “semantic search.” I get why. It sounds modern, and it feels like the safe enterprise choice. But for a huge percentage of real apps, that move is premature.

If your product already runs on PostgreSQL, adding pgvector is often the fastest way to ship useful semantic search without adding another production system to babysit.

This isn’t a “toy demo” argument. This is a production argument: fewer moving parts, fewer points of failure, and better team velocity while you’re still finding product-market fit for search itself.

The Real Decision

The wrong question is: “What’s the most advanced vector architecture?”

The right question is: “What architecture gets us reliable retrieval quality this quarter, with the team we actually have?”

For many products, PostgreSQL + pgvector wins early because:

You already have migrations, backups, monitoring, and access controls.
Your team already knows how to run Postgres.
Your app data and embeddings can live in one transactional system.
You can avoid sync drift between primary DB and vector DB.

You’re trading some extreme-scale headroom for execution speed and operational simplicity. Early on, that’s usually the right trade.

Mental Model: Start Inside the House

Think of semantic search like adding a new room to a house you already live in.

Postgres + pgvector: remodel one room, no new property taxes.
Dedicated vector DB: buy a second house and manage two mortgages.

At very large scale, second house might be worth it. But if you’re still validating relevance quality, query patterns, and user behavior, own less infrastructure first.

When `pgvector` Is a Great Fit

pgvector is usually strong when:

Your corpus is in the low millions of documents (or less).
You need hybrid retrieval (metadata filters + vector similarity).
You care about transactional consistency between app writes and embeddings.
Your platform team is small.

It’s less ideal when:

You need ultra-low-latency search at very high concurrent query volume.
Your embedding corpus is exploding rapidly into very large scale.
You need highly specialized ANN tuning beyond what your team can maintain in Postgres.

That last bullet matters: a technically superior system is still a bad choice if nobody can operate it confidently.

Baseline Schema (Production-Friendly)

Use a table that keeps retrieval inputs and filtering metadata close together.

create extension if not exists vector;

create table documents (
  id uuid primary key default gen_random_uuid(),
  tenant_id uuid not null,
  source_type text not null,
  title text not null,
  body text not null,
  language text not null default 'en',
  published_at timestamptz,
  -- 1536 for text-embedding-3-small, adjust per model
  embedding vector(1536),
  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now()
);

create index documents_tenant_idx on documents (tenant_id);
create index documents_source_idx on documents (source_type);
create index documents_published_idx on documents (published_at desc);

Why this shape works:

Multi-tenant access stays explicit (tenant_id).
You can pre-filter before vector ranking (faster and cleaner relevance).
You avoid scattering retrieval-critical metadata across multiple joins.

Indexing Strategy Without Getting Fancy Too Early

Start with a straightforward ANN index and measure before tuning.

create index documents_embedding_ivfflat
on documents using ivfflat (embedding vector_cosine_ops)
with (lists = 100);

Practical guidance:

lists is not “bigger is always better.” It’s a recall/latency dial.
Revisit index params with real query logs, not guesswork.
Tune in stages: baseline -> profile -> adjust.

If your team immediately starts hand-tuning ANN internals without quality benchmarks, you’re optimizing blind.

Write Path: Keep Embeddings Deterministic

A common mistake is embedding generation as an ad-hoc side effect. Treat it as a first-class pipeline.

async function upsertDocument(input: {
  id?: string;
  tenantId: string;
  title: string;
  body: string;
  sourceType: "docs" | "faq" | "ticket";
  language?: string;
}) {
  const textForEmbedding = `${input.title}\n\n${input.body}`;

  const embedding = await embeddingClient.embed({
    model: "text-embedding-3-small",
    input: textForEmbedding,
  });

  await db.query(
    `
      insert into documents (id, tenant_id, source_type, title, body, language, embedding)
      values (coalesce($1::uuid, gen_random_uuid()), $2, $3, $4, $5, coalesce($6, 'en'), $7)
      on conflict (id)
      do update set
        title = excluded.title,
        body = excluded.body,
        language = excluded.language,
        embedding = excluded.embedding,
        updated_at = now()
    `,
    [input.id ?? null, input.tenantId, input.sourceType, input.title, input.body, input.language ?? "en", embedding]
  );
}

Key production habits:

Version your embedding model and store that version if you can.
Re-embed consistently on material content changes.
Queue retries for transient embedding API failures.

Read Path: Filter First, Rank Second

Use metadata filters before similarity ranking when possible.

select
  id,
  title,
  source_type,
  1 - (embedding <=> $1::vector) as similarity
from documents
where tenant_id = $2
  and source_type = any($3::text[])
  and published_at is not null
order by embedding <=> $1::vector
limit 20;

Then do lightweight reranking in application logic if needed:

Promote freshness for time-sensitive content.
Add business rules (e.g., official docs > forum posts).
Collapse near-duplicate results.

That two-step pattern is usually enough before you need a separate retrieval stack.

Retrieval Quality Guardrails (What Actually Improves UX)

Most semantic-search failures are not infra failures. They’re retrieval-quality failures.

Add these guardrails early:

Query set for evaluation

Build 30-50 real user-style queries.
Track top-3 and top-10 relevance manually at first.

Gold answers for critical queries

Define what a “good result” looks like for high-intent queries.

Change budget

Don’t change embedding model + chunking + index params in one release.
Isolate variables so you can attribute regressions.

Hard filters for trust

Don’t let semantically similar but unauthorized data leak across tenants.

Infrastructure is easy to argue about. Relevance quality is what users feel.

Production Concerns You Should Decide Up Front

1) Chunking policy

Semantic quality depends heavily on chunk boundaries.

Documentation: chunk by headings + paragraph groups.
Support tickets: chunk by message turns with speaker context.
Avoid tiny fragments that lose meaning.

2) Re-embedding policy

Set explicit triggers:

Major body edits -> re-embed required.
Minor typo edits -> optional.
Model upgrade -> scheduled bulk re-embed.

3) Observability

Track these metrics from day one:

Query latency p50/p95
Embedding generation latency and failure rate
Recall proxy (how often expected docs appear in top-k)
Click-through or downstream task success

If you only monitor infra metrics and ignore relevance outcomes, you won’t catch “looks healthy, feels bad” failures.

What a Sensible Rollout Looks Like

Don’t big-bang semantic search.

Phase 1: Internal-only queries

Use staff queries against staging data.
Validate tenant filters and obvious relevance.

Phase 2: Read-only beta in production

Show semantic results alongside existing keyword search.
Compare click behavior.

Phase 3: Gradual primary ranking shift

Route a small percentage of users to semantic-first ranking.
Keep rollback simple.

Phase 4: Optimization only after behavior is clear

Tune ANN settings.
Add reranking logic.
Evaluate whether you truly need external vector infra.

This sequence protects users while keeping momentum.

When Not to Use `pgvector`

Be honest about limits.

Use a dedicated vector database sooner if:

Query volume and latency SLOs are already strict.
You have a large retrieval platform team.
Your product is retrieval-heavy enough that vector infra is core, not supporting.

That’s not a failure of Postgres. It’s a maturity transition.

Final Checklist

Before shipping semantic search on Postgres, confirm:

Tenant and permission filters are enforced in SQL.

Embedding model and dimension are explicitly versioned.

Re-embedding triggers are documented.

Query-quality eval set exists.

p95 latency and failure metrics are visible.

Rollback path is tested.

If you can check these boxes, you’re not “hacking it together.” You’re running a disciplined retrieval system.

Closing

You probably don’t need Pinecone on day one. You need reliable retrieval that helps users find the right thing fast, and a system your team can operate without heroics.

For many teams, pgvector in PostgreSQL is exactly that system.

Ship it there first. Learn from real usage. Then earn the right to add infrastructure complexity when scale, latency, or product shape actually demands it.