Embeddings drift; treat them like any other dataset.
Look, I get it—shipping features is fun. But shipping correct pipelines is somehow considered optional.
The problem (a.k.a. where most designs go to die)
In production, data systems fail in boring ways:
- retries that create duplicates
- backfills that rewrite history (and then everyone argues about “the source of truth”)
- schema changes that arrive unannounced because someone added a column “real quick”
- caches (browser, CDN, query engine) that serve stale results while people swear “nothing changed”
If any of that sounds familiar, congratulations—you’re operating a real system.
A reference architecture that actually survives contact with reality
A reliable design usually has these properties:
- Deterministic inputs (you can describe exactly what this run consumes)
- Idempotent outputs (re-running produces the same end state)
- Versioned contracts at boundaries (schemas, keys, SLAs)
- Observable execution (metrics, lineage, logs with correlation)
Here’s the part everyone skips: you don’t get these by adding a dashboard. You get them by encoding invariants into the pipeline.
Implementation notes (things you’ll thank yourself for later)
- Commit metadata: record batch identifiers, source offsets, and the code version that produced the output.
- Monotonic event ordering: define ordering rules (event time + tie-breakers) and enforce them in merges.
- Backfill isolation: run backfills in separate compute pools / queues with explicit throughput caps.
- Contract enforcement: validate schema + keys at ingestion; quarantine bad records (don’t “fix” them silently).
Concrete example
# Example: deterministic batch id + exactly-once-ish writes
import hashlib
def batch_id(run_date: str, dataset: str) -> str:
return hashlib.sha256(f"{dataset}:{run_date}".encode()).hexdigest()
# Use batch_id as part of output path or commit metadata.
Failure modes you should plan for (because they will happen)
- Late data: decide what “late” means, and what you do when it happens.
- Out-of-order updates: if you can’t deterministically reconcile, you’re not doing CDC—you’re doing vibes.
- Partial writes: either use atomic commits (preferred) or write-then-promote patterns.
- Hot partitions: your hash key choices are architecture, not implementation detail.
Observability checklist (minimal, not optional)
- A run id / batch id that propagates through logs
- Row counts in/out with anomaly detection
- Lag metrics (event-time lag for streaming; freshness for batch)
- Data quality checks that fail the run, not just “notify Slack politely”
Takeaways
- The fastest path is the one you can safely re-run.
- “Exactly once” is a design constraint, not a product checkbox.
- If you can’t explain your pipeline’s invariants in one page, you can’t operate it.
Next time someone says “just rerun it”, smile and ask: “Cool—what’s the idempotency key?”
