Data Lake & Lakehouse
The storage and ingestion sub-service under Data Foundations. A lakehouse on open table formats with streaming and CDC ingestion, lineage, dead-letter handling, and retrieval indexes. One substrate for analytics, AI, and operational tools.
What it is
Data Lake & Lakehouse is the storage and ingestion sub-service under Data Foundations. It is the substrate where your tables live: Parquet files on cloud object storage (S3, GCS, or Azure Blob), managed by an open table format (Iceberg, Delta Lake, or Hudi) that adds ACID transactions, schema evolution, and time travel while staying open to any compute engine. It is also the ingestion plumbing that fills those tables: batch loads, streaming sources, change-data-capture from operational databases, and the contracts that keep all of it honest.
Why a lakehouse, not a warehouse or a raw lake
A raw lake (files dumped on object storage with no schema discipline) becomes a swamp within a year. Nobody can find anything, queries are slow, and nothing is consistent. A traditional warehouse gives you correctness but locks you into one vendor's compute and struggles with the unstructured content AI needs. A lakehouse keeps the cheap, open storage of a lake and adds the warehouse's correctness on top: ACID, schema evolution, time travel, column statistics. The same files serve a BI tool, a notebook, and an agent's retrieval without copies or drift.
What we build
An open table format on your chosen object storage, sized to your workload. Ingestion paths that match the data's cadence: streaming via Kafka, Kinesis, or Pulsar; CDC via Debezium for operational databases; batch loads for daily sources. Each path runs through ingestion contracts (input shape, freshness target, owner, error budget) with quality gates that route bad records to a dead-letter table. Column-level lineage. A catalog (Unity Catalog, Polaris, DataHub, or OpenMetadata) so the substrate is discoverable. Retrieval-ready transformations (chunking, embeddings, hybrid BM25 + dense indexes) sit alongside the analytical tables, so one source of truth powers a dashboard and a citation.
- Open table format (Iceberg, Delta, or Hudi) on object storage
- Streaming (Kafka, Kinesis, Pulsar) and CDC (Debezium) ingestion
- Source contracts with named owners and freshness SLOs
- Column- and row-level lineage with impact analysis
- Quality gates at ingestion with dead-letter tables
- Hybrid retrieval indexes (BM25 + dense embeddings)
- Catalog and discovery (Unity Catalog, Polaris, DataHub, or OpenMetadata)
How it sits under Data Foundations
Data Foundations is the umbrella: the strategy, the source graph, the operating discipline. The Lakehouse is one of two delivery sub-services (the other is the LLM-Ready Knowledge Base). A common engagement needs both. The lakehouse holds structured truth (transactions, accounts, telemetry, events), the knowledge base holds unstructured truth (documents, contracts, tickets, conversations), and they share the lineage and access controls defined at the Foundations layer. Pointing AI at a lakehouse without a knowledge base, or vice versa, is a common reason teams hit a ceiling on what their agents can do.
Where we draw the line
Not a vendor pitch. Iceberg, Delta, and Hudi fit different access patterns; we recommend based on your reads, writes, and compute. Open table formats reduce lock-in: the same Parquet files are queryable by Spark, Trino, DuckDB, Snowflake, BigQuery, or Databricks. Switching engines is still adaptation work, not a one-line change. Ingestion contracts and quality gates are a continuous discipline, not a project that finishes.
When you should start
Signals: analytical workloads sprawled across a warehouse, a few notebooks, and ad-hoc S3 buckets with no governance; AI pilots each spinning up their own pipelines because no governed table substrate exists; streaming sources (events, IoT, app telemetry) that are not yet first-class citizens; premium warehouse storage paid for data that belongs on object storage. Common starting points: consolidate analytical workloads onto a lakehouse so retrieval and BI share a source of truth, add ingestion contracts and lineage to an existing warehouse pipeline, or run a 4-week proof on one high-value source migrated end-to-end with contracts, quality gates, and a retrieval index.
Related learning
A navigable map of every system your data lives in — schemas, documents, code, tickets, events, owners, and permissions — so an AI agent can find the right source and respect the right access boundary.
Contracts, validation, lineage, freshness, and ownership for the data your AI reads from — not a one-time cleanup project, an ongoing operating discipline.
How an AI agent finds the right document, chunk, or row to ground its answer in — and why the part that matters is the pipeline around the database, not the database itself.
An architecture that combines data-lake economics (cheap object storage, open file formats) with warehouse guarantees (ACID transactions, schema evolution, time travel) so analytics, AI retrieval, and machine learning all read from the same trusted tables.
Explicit agreements between a data source and the systems that depend on it — what shape, how fresh, who owns it, what counts as broken — so pipeline failures become attributable instead of mysterious.