It’s great until you run 5000 pods and discover quotas.
Look, I get it—shipping features is fun. But shipping correct pipelines is somehow considered optional.
The problem (a.k.a. where most designs go to die)
In production, data systems fail in boring ways:
- retries that create duplicates
- backfills that rewrite history (and then everyone argues about “the source of truth”)
- schema changes that arrive unannounced because someone added a column “real quick”
- caches (browser, CDN, query engine) that serve stale results while people swear “nothing changed”
If any of that sounds familiar, congratulations—you’re operating a real system.
A reference architecture that actually survives contact with reality
A reliable design usually has these properties:
- Deterministic inputs (you can describe exactly what this run consumes)
- Idempotent outputs (re-running produces the same end state)
- Versioned contracts at boundaries (schemas, keys, SLAs)
- Observable execution (metrics, lineage, logs with correlation)
Here’s the part everyone skips: you don’t get these by adding a dashboard. You get them by encoding invariants into the pipeline.
Implementation notes (things you’ll thank yourself for later)
- Commit metadata: record batch identifiers, source offsets, and the code version that produced the output.
- Monotonic event ordering: define ordering rules (event time + tie-breakers) and enforce them in merges.
- Backfill isolation: run backfills in separate compute pools / queues with explicit throughput caps.
- Contract enforcement: validate schema + keys at ingestion; quarantine bad records (don’t “fix” them silently).
Concrete example
# Example: a minimal data contract / schema expectation
version: 1
entity: orders
primary_key: [order_id]
columns:
- name: order_id
type: string
required: true
- name: order_ts
type: timestamp
required: true
- name: total_amount
type: decimal(18,2)
required: true
checks:
- name: non_negative_amount
expr: total_amount >= 0
Failure modes you should plan for (because they will happen)
- Late data: decide what “late” means, and what you do when it happens.
- Out-of-order updates: if you can’t deterministically reconcile, you’re not doing CDC—you’re doing vibes.
- Partial writes: either use atomic commits (preferred) or write-then-promote patterns.
- Hot partitions: your hash key choices are architecture, not implementation detail.
Observability checklist (minimal, not optional)
- A run id / batch id that propagates through logs
- Row counts in/out with anomaly detection
- Lag metrics (event-time lag for streaming; freshness for batch)
- Data quality checks that fail the run, not just “notify Slack politely”
Takeaways
- The fastest path is the one you can safely re-run.
- “Exactly once” is a design constraint, not a product checkbox.
- If you can’t explain your pipeline’s invariants in one page, you can’t operate it.
Next time someone says “just rerun it”, smile and ask: “Cool—what’s the idempotency key?”
