Service

Data Lake & Lakehouse

The storage and ingestion sub-service under Data Foundations. A lakehouse on open table formats with streaming and CDC ingestion, lineage, dead-letter handling, and retrieval indexes. One substrate for analytics, AI, and operational tools.

Productized as

Library Q&A Library MCP Library Watch

Same engine — packaged for a faster start

What it is

Data Lake & Lakehouse is the storage and ingestion sub-service under Data Foundations. It is the substrate where your tables live: Parquet files on cloud object storage (S3, GCS, or Azure Blob), managed by an open table format (Iceberg, Delta Lake, or Hudi) that adds ACID transactions, schema evolution, and time travel while staying open to any compute engine. It is also the ingestion plumbing that fills those tables: batch loads, streaming sources, change-data-capture from operational databases, and the contracts that keep all of it honest.

Why a lakehouse, not a warehouse or a raw lake

A raw lake (files dumped on object storage with no schema discipline) becomes a swamp within a year. Nobody can find anything, queries are slow, and nothing is consistent. A traditional warehouse gives you correctness but locks you into one vendor's compute and struggles with the unstructured content AI needs. A lakehouse keeps the cheap, open storage of a lake and adds the warehouse's correctness on top: ACID, schema evolution, time travel, column statistics. The same files serve a BI tool, a notebook, and an agent's retrieval without copies or drift.

What we build

An open table format on your chosen object storage, sized to your workload. Ingestion paths that match the data's cadence: streaming via Kafka, Kinesis, or Pulsar; CDC via Debezium for operational databases; batch loads for daily sources. Each path runs through ingestion contracts (input shape, freshness target, owner, error budget) with quality gates that route bad records to a dead-letter table. Column-level lineage. A catalog (Unity Catalog, Polaris, DataHub, or OpenMetadata) so the substrate is discoverable. Retrieval-ready transformations (chunking, embeddings, hybrid BM25 + dense indexes) sit alongside the analytical tables, so one source of truth powers a dashboard and a citation.

Open table format (Iceberg, Delta, or Hudi) on object storage
Streaming (Kafka, Kinesis, Pulsar) and CDC (Debezium) ingestion
Source contracts with named owners and freshness SLOs
Column- and row-level lineage with impact analysis
Quality gates at ingestion with dead-letter tables
Hybrid retrieval indexes (BM25 + dense embeddings)
Catalog and discovery (Unity Catalog, Polaris, DataHub, or OpenMetadata)

How it sits under Data Foundations

Data Foundations is the umbrella: the strategy, the source graph, the operating discipline. The Lakehouse is one of two delivery sub-services (the other is the LLM-Ready Knowledge Base). A common engagement needs both. The lakehouse holds structured truth (transactions, accounts, telemetry, events), the knowledge base holds unstructured truth (documents, contracts, tickets, conversations), and they share the lineage and access controls defined at the Foundations layer. Pointing AI at a lakehouse without a knowledge base, or vice versa, is a common reason teams hit a ceiling on what their agents can do.

Where we draw the line

Not a vendor pitch. Iceberg, Delta, and Hudi fit different access patterns; we recommend based on your reads, writes, and compute. Open table formats reduce lock-in: the same Parquet files are queryable by Spark, Trino, DuckDB, Snowflake, BigQuery, or Databricks. Switching engines is still adaptation work, not a one-line change. Ingestion contracts and quality gates are a continuous discipline, not a project that finishes.

When you should start

Signals: analytical workloads sprawled across a warehouse, a few notebooks, and ad-hoc S3 buckets with no governance; AI pilots each spinning up their own pipelines because no governed table substrate exists; streaming sources (events, IoT, app telemetry) that are not yet first-class citizens; premium warehouse storage paid for data that belongs on object storage. Common starting points: consolidate analytical workloads onto a lakehouse so retrieval and BI share a source of truth, add ingestion contracts and lineage to an existing warehouse pipeline, or run a 4-week proof on one high-value source migrated end-to-end with contracts, quality gates, and a retrieval index.

Related learning

Source Graph

A navigable map of every system your data lives in — schemas, documents, code, tickets, events, owners, and permissions — so an AI agent can find the right source and respect the right access boundary.

Data Quality

Contracts, validation, lineage, freshness, and ownership for the data your AI reads from — not a one-time cleanup project, an ongoing operating discipline.

Vector Search

How an AI agent finds the right document, chunk, or row to ground its answer in — and why the part that matters is the pipeline around the database, not the database itself.

Lakehouse Architecture

An architecture that combines data-lake economics (cheap object storage, open file formats) with warehouse guarantees (ACID transactions, schema evolution, time travel) so analytics, AI retrieval, and machine learning all read from the same trusted tables.

Ingestion Contracts

Explicit agreements between a data source and the systems that depend on it — what shape, how fresh, who owns it, what counts as broken — so pipeline failures become attributable instead of mysterious.