Kubernetes for Data Jobs: The Part Where YAML Becomes a Lifestyle

It’s great until you run 5000 pods and discover quotas.

Look, I get it—shipping features is fun. But shipping correct pipelines is somehow considered optional.

The problem (a.k.a. where most designs go to die)

In production, data systems fail in boring ways:

retries that create duplicates
backfills that rewrite history (and then everyone argues about “the source of truth”)
schema changes that arrive unannounced because someone added a column “real quick”
caches (browser, CDN, query engine) that serve stale results while people swear “nothing changed”

If any of that sounds familiar, congratulations—you’re operating a real system.

A reference architecture that actually survives contact with reality

A reliable design usually has these properties:

Deterministic inputs (you can describe exactly what this run consumes)
Idempotent outputs (re-running produces the same end state)
Versioned contracts at boundaries (schemas, keys, SLAs)
Observable execution (metrics, lineage, logs with correlation)

Here’s the part everyone skips: you don’t get these by adding a dashboard. You get them by encoding invariants into the pipeline.

Implementation notes (things you’ll thank yourself for later)

Commit metadata: record batch identifiers, source offsets, and the code version that produced the output.
Monotonic event ordering: define ordering rules (event time + tie-breakers) and enforce them in merges.
Backfill isolation: run backfills in separate compute pools / queues with explicit throughput caps.
Contract enforcement: validate schema + keys at ingestion; quarantine bad records (don’t “fix” them silently).

Concrete example

# Example: a minimal data contract / schema expectation
version: 1
entity: orders
primary_key: [order_id]
columns:
  - name: order_id
    type: string
    required: true
  - name: order_ts
    type: timestamp
    required: true
  - name: total_amount
    type: decimal(18,2)
    required: true
checks:
  - name: non_negative_amount
    expr: total_amount >= 0

Failure modes you should plan for (because they will happen)

Late data: decide what “late” means, and what you do when it happens.
Out-of-order updates: if you can’t deterministically reconcile, you’re not doing CDC—you’re doing vibes.
Partial writes: either use atomic commits (preferred) or write-then-promote patterns.
Hot partitions: your hash key choices are architecture, not implementation detail.

Observability checklist (minimal, not optional)

A run id / batch id that propagates through logs
Row counts in/out with anomaly detection
Lag metrics (event-time lag for streaming; freshness for batch)
Data quality checks that fail the run, not just “notify Slack politely”

Takeaways

The fastest path is the one you can safely re-run.
“Exactly once” is a design constraint, not a product checkbox.
If you can’t explain your pipeline’s invariants in one page, you can’t operate it.

Next time someone says “just rerun it”, smile and ask: “Cool—what’s the idempotency key?”