Data substrate

Lakehouse

An architecture that combines cheap object storage with warehouse-grade table guarantees — the substrate where analytics, AI retrieval, and ML training read from the same governed tables.

Operating principle

Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.

What it is

A lakehouse is data-lake economics with data-warehouse correctness. Files stored as Parquet on cheap object storage (S3, GCS, Azure Blob), managed by an open table format (Apache Iceberg, Delta Lake, Apache Hudi) that adds ACID transactions, schema evolution, and time travel. One substrate, queried by many engines.

Why it matters

Before lakehouses, organizations duplicated their data: a warehouse for analytics, a separate lake for ML, a third copy for AI. The copies drifted. The numbers disagreed. The lakehouse removes the duplicates — analytics and AI read from the same trusted tables.

How it works

See the dedicated Lakehouse Architecture article for the full mechanism.

Related resources

Lakehouse Architecture

An architecture that combines data-lake economics (cheap object storage, open file formats) with warehouse guarantees (ACID transactions, schema evolution, time travel) so analytics, AI retrieval, and machine learning all read from the same trusted tables.

Data Lake

A capability in the Group e-media information AI stack. This resource connects the subject to data substrate, agent runtime, evals, and operations.

Data Quality

Contracts, validation, lineage, freshness, and ownership for the data your AI reads from — not a one-time cleanup project, an ongoing operating discipline.

Lakehouse

What it is

Why it matters

How it works

Related concepts

Related resources