data-engineering
The Annoying Truth About ‘Serverless’ Data
Serverless mostly means ‘someone else runs the servers’. You still pay.
Lambda Architecture Without the Trauma
Hybrid batch+streaming can work—if you pick a single source of truth and stop duplicating business logic.
Vector Search Pipelines: Embeddings Are Data Engineering Too
Embeddings drift; treat them like any other dataset.
Feature Stores: Centralize Reuse, Decentralize Blame
A feature store is a contract system with extra steps.
Observability: Trace IDs for Data Pipelines (Yes, It Works)
Correlate events across ingest → transform → serve. Debugging gets boring.
Serving Layers: Materialized Views, Caches, and the Myth of ‘Realtime’
Realtime is a budget decision.
Metadata-Driven Pipelines: Dynamic Doesn’t Mean Uncontrolled
Drive config from metadata, but validate like a paranoid adult.
Bronze Table Quality Gates: Yes, Even Bronze
If you ingest garbage, you’ll analyze garbage. That’s not ‘agile’.
Kubernetes for Data Jobs: The Part Where YAML Becomes a Lifestyle
It’s great until you run 5000 pods and discover quotas.
Change Data Capture on Azure: Event Hubs, Debezium, and Reality
Azure can do CDC fine—if you respect throughput units and partition keys.
Stateful Streaming: Size Your State, or Your State Sizes You
State management is capacity planning disguised as code.
JSON in the Lake: Sure, If You Hate Yourself
Store JSON raw, but normalize early if you like money.
The Right Way to Version Datasets (Hint: Not Folder Names)
Versioning is lineage plus immutability plus access patterns.
Batch Windows: Stop Using ‘Daily’ as a Data Type
Time zones and DST are not edge cases; they’re the default.
Multi-tenant Data Platforms: Isolation or Incident
Noisy neighbors aren’t cute when they burn your CPU budget.
Object Storage Consistency: Read-After-Write Is Not a Love Language
Assume eventual consistency somewhere; design around it.
Indexing Isn’t Just for OLTP: Metastore Design Matters
If your metastore is slow, everything is slow. Welcome.
dbt Snapshots: Helpful, Until You Pretend They’re CDC
Snapshots are a modeling tool, not a streaming system.
Cost Engineering for Data: Because Finance Eventually Finds You
Treat cost as a first-class metric, or it will treat you as a line item.
Survivable Pipelines: Replay, Reprocess, Reconcile
If you can’t replay, you can’t recover.
The Case for Immutable Data: Mutable Data Is How You Get Ghost Bugs
Immutability makes pipelines debuggable and reproducible.
Watermark Strategy: You Can’t Outrun Late Data
Pick a lateness budget; enforce it; move on.
Spark Shuffle Tuning: Because Default Settings Love Chaos
Tune shuffles with evidence, not folklore.
The One True Pattern for ‘Delete’ Events in Append-Only Lakes
Soft deletes are easy; correct deletes are architecture.
Medallion Layering with Real Contracts (Not Just Folder Names)
A folder is not a boundary. A validated schema is.
Streaming Joins: Where Innocence Goes to Die
State grows, watermarks slip, and suddenly you’re paging at 2am.
Data Mesh: The Part Where You Rebuild Central Governance Anyway
Decentralize ownership; centralize standards. Yes, both.
Lakehouse Tables vs Warehouse Tables: It’s About Guarantees
Pick the engine based on constraints and workload, not hype.
SLA vs SLO vs ‘We Promise’—Define It Like You Mean It
If it’s not measured, it’s not real. If it’s not alerting, it’s a chart.
Fan-out Pipelines: Congratulations, You Built a DoS Machine
A little buffering and quotas save a lot of pain.
Time Travel Isn’t Backup: Snapshots, Retention, and Regret
Retention policies are not optional; your lawyers will agree.
Row-Level Security: The Part Where Governance Meets Reality
Security models must be enforceable at query time—period.
Compression Codecs: ZSTD Is Great Until Your CPU Says No
Compression trades bytes for cycles; measure both.
Great Expectations Isn’t Enough: Make Tests Fail the Build
Data tests that don’t block deploys are motivational posters.
Dimensional Modeling in 2025: Still Works, Still Misused
Facts and dimensions aren’t dead; your misuse of them is.
Terraforming Data Platforms: Drift Happens. Manage It.
If your data platform is click-ops, your incident schedule will reflect that.
OpenLineage + DataHub: Observability for People Who Like Sleep
Lineage is a debugging primitive, not just a compliance checkbox.
Column Pruning and Predicate Pushdown: The Two Free Lunches
If you’re scanning all columns, you’re doing it wrong. Yes, you.
Exactly-Once: The Marketing Term You Can Actually Implement
Exactly-once is achievable in narrow pipelines if you respect the rules.
Slowly Changing Dimensions Without Slowly Changing Sanity
SCD patterns are deterministic—your tooling should be too.
Micro-batch vs True Streaming: The Cost of Pretending
Micro-batch is great—until your latency SLO is not.
Dedup at Scale: Hashes, Windows, and the Art of Not Double-Charging
Dedup is easy until you need correctness and speed at the same time.
Data Contracts: Because ‘It Worked Yesterday’ Isn’t a Spec
Define expectations at the boundary and enforce them automatically.
The Practical Guide to Backfills Without Setting the Warehouse on Fire
Backfills are load tests you run in production. Plan accordingly.
Event Time vs Processing Time: Choose Wrong, Debug Forever
Late data is inevitable; your design should be, too.
Delta Lake OPTIMIZE Isn’t a Spell. It’s a Bill.
Compaction improves reads; it also writes a lot. Surprise.
Lakehouse Medallion Layers: Bronze/Silver/Gold… or Just Bronze Forever
Layers only help if you enforce invariants at each boundary.
Kafka Compaction: The Feature You Ignore Until It Eats Your Cluster
Compaction is incredible—unless you mis-key and turn history into confetti.
Schema Evolution with Iceberg: The Grown-Up Way to Break Things
Evolving schemas safely requires contracts, not vibes.
Idempotency: Because Retries Are Not a Personality Trait
If your pipeline breaks on retry, it’s not ‘flaky’—it’s wrong.
Partitioning: Stop Worshipping the Date Column
Partitioning is an I/O strategy, not a religious practice.
Your CDC Pipeline Is Lying (and Your SCD2 Table Knows It)
Designing CDC with exactly-once semantics is less about magic and more about being painfully explicit.