Vector Search Pipelines: Embeddings Are Data Engineering Too

Embeddings drift; treat them like any other dataset.

Look, I get it—shipping features is fun. But shipping correct pipelines is somehow considered optional.

The problem (a.k.a. where most designs go to die)

In production, data systems fail in boring ways:

retries that create duplicates
backfills that rewrite history (and then everyone argues about “the source of truth”)
schema changes that arrive unannounced because someone added a column “real quick”
caches (browser, CDN, query engine) that serve stale results while people swear “nothing changed”

If any of that sounds familiar, congratulations—you’re operating a real system.

A reference architecture that actually survives contact with reality

A reliable design usually has these properties:

Deterministic inputs (you can describe exactly what this run consumes)
Idempotent outputs (re-running produces the same end state)
Versioned contracts at boundaries (schemas, keys, SLAs)
Observable execution (metrics, lineage, logs with correlation)

Here’s the part everyone skips: you don’t get these by adding a dashboard. You get them by encoding invariants into the pipeline.

Implementation notes (things you’ll thank yourself for later)

Commit metadata: record batch identifiers, source offsets, and the code version that produced the output.
Monotonic event ordering: define ordering rules (event time + tie-breakers) and enforce them in merges.
Backfill isolation: run backfills in separate compute pools / queues with explicit throughput caps.
Contract enforcement: validate schema + keys at ingestion; quarantine bad records (don’t “fix” them silently).

Concrete example

# Example: deterministic batch id + exactly-once-ish writes
import hashlib

def batch_id(run_date: str, dataset: str) -> str:
    return hashlib.sha256(f"{dataset}:{run_date}".encode()).hexdigest()

# Use batch_id as part of output path or commit metadata.

Failure modes you should plan for (because they will happen)

Late data: decide what “late” means, and what you do when it happens.
Out-of-order updates: if you can’t deterministically reconcile, you’re not doing CDC—you’re doing vibes.
Partial writes: either use atomic commits (preferred) or write-then-promote patterns.
Hot partitions: your hash key choices are architecture, not implementation detail.

Observability checklist (minimal, not optional)

A run id / batch id that propagates through logs
Row counts in/out with anomaly detection
Lag metrics (event-time lag for streaming; freshness for batch)
Data quality checks that fail the run, not just “notify Slack politely”

Takeaways

The fastest path is the one you can safely re-run.
“Exactly once” is a design constraint, not a product checkbox.
If you can’t explain your pipeline’s invariants in one page, you can’t operate it.

Next time someone says “just rerun it”, smile and ask: “Cool—what’s the idempotency key?”