Data substrate

Lakehouse Architecture

An architecture that combines data-lake economics (cheap object storage, open file formats) with warehouse guarantees (ACID transactions, schema evolution, time travel) so analytics, AI retrieval, and machine learning all read from the same trusted tables.

Operating principle

Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.

What it is

The term, popularized by Databricks, describes a stack where Parquet files in object storage (S3, GCS, Azure Blob) are managed by an open table format — Apache Iceberg, Delta Lake, or Apache Hudi — that adds transactions, schema evolution, partition evolution, time travel, and concurrent writes. The result has the cost and openness of a data lake with the correctness guarantees of a warehouse.

  • Object storage + Parquet for cost and openness
  • Iceberg, Delta, or Hudi for ACID and schema evolution
  • Compute pluggable: Spark, Trino, DuckDB, Snowflake, BigQuery
  • Time travel and partition pruning for query performance

Why it matters for AI

Retrieval pipelines, analytics dashboards, and ML training read from the same tables. There is no longer a 'warehouse copy' and 'AI copy' that drift apart. When a number on the executive dashboard differs from what the agent cited, the disagreement is a real bug, not a sync delay.

What we evaluate

Table format choice depends on your existing stack and write patterns: Iceberg has the broadest engine support and the cleanest spec; Delta is the obvious choice on Databricks; Hudi shines on streaming upserts with frequent CDC. Engagements often start with a small migration pilot to surface the actual operational differences before committing.

Related resources