basix
by basix
3 min read

Categories

  • data-engineering

Tags

  • serving performance

Realtime is a budget decision.

Look, I get it—shipping features is fun. But shipping correct pipelines is somehow considered optional.

The problem (a.k.a. where most designs go to die)

In production, data systems fail in boring ways:

  • retries that create duplicates
  • backfills that rewrite history (and then everyone argues about “the source of truth”)
  • schema changes that arrive unannounced because someone added a column “real quick”
  • caches (browser, CDN, query engine) that serve stale results while people swear “nothing changed”

If any of that sounds familiar, congratulations—you’re operating a real system.

A reference architecture that actually survives contact with reality

A reliable design usually has these properties:

  1. Deterministic inputs (you can describe exactly what this run consumes)
  2. Idempotent outputs (re-running produces the same end state)
  3. Versioned contracts at boundaries (schemas, keys, SLAs)
  4. Observable execution (metrics, lineage, logs with correlation)

Here’s the part everyone skips: you don’t get these by adding a dashboard. You get them by encoding invariants into the pipeline.

Implementation notes (things you’ll thank yourself for later)

  • Commit metadata: record batch identifiers, source offsets, and the code version that produced the output.
  • Monotonic event ordering: define ordering rules (event time + tie-breakers) and enforce them in merges.
  • Backfill isolation: run backfills in separate compute pools / queues with explicit throughput caps.
  • Contract enforcement: validate schema + keys at ingestion; quarantine bad records (don’t “fix” them silently).

Concrete example

-- Example: idempotent merge for CDC into a lake table
MERGE INTO silver.customer t
USING staging.customer_cdc s
ON t.customer_id = s.customer_id
WHEN MATCHED AND s.op = 'D' THEN DELETE
WHEN MATCHED AND s.op IN ('U','C') AND s.event_ts >= t.last_event_ts THEN
  UPDATE SET *
WHEN NOT MATCHED AND s.op IN ('C') THEN
  INSERT *;

Failure modes you should plan for (because they will happen)

  • Late data: decide what “late” means, and what you do when it happens.
  • Out-of-order updates: if you can’t deterministically reconcile, you’re not doing CDC—you’re doing vibes.
  • Partial writes: either use atomic commits (preferred) or write-then-promote patterns.
  • Hot partitions: your hash key choices are architecture, not implementation detail.

Observability checklist (minimal, not optional)

  • A run id / batch id that propagates through logs
  • Row counts in/out with anomaly detection
  • Lag metrics (event-time lag for streaming; freshness for batch)
  • Data quality checks that fail the run, not just “notify Slack politely”

Takeaways

  • The fastest path is the one you can safely re-run.
  • “Exactly once” is a design constraint, not a product checkbox.
  • If you can’t explain your pipeline’s invariants in one page, you can’t operate it.

Next time someone says “just rerun it”, smile and ask: “Cool—what’s the idempotency key?”