Data substrate

Data Quality

Contracts, validation, lineage, freshness, and ownership for the data your AI reads from — not a one-time cleanup project, an ongoing operating discipline.

Operating principle

Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.

The substrate determines the agent

If data is stale, ambiguous, unowned, or untrusted, agents amplify the problem at speed. A retrieval pipeline pulling from a table no one has owned in two years is a lawsuit waiting for a citation. Data quality is not cleanup work; it is operational risk management with a name on every source.

How we enforce it

Source contracts declare the input shape, freshness target, owner, and error budget for each ingestion path. Validation runs at write time: failed records hit a dead-letter table with a notification, not silent drops. Freshness, row-count, schema, and null-rate checks run on read at the lakehouse boundary. Quality regressions page the source owner, not the downstream consumer.

Source contracts with owner and freshness SLO
Schema, null-rate, and distribution checks
Dead-letter tables for failed records
Lineage that names the source owner on failure

Tools that fit

dbt or SQLMesh for transformations with built-in tests; Great Expectations, Soda, or Monte Carlo for data quality; OpenLineage / Marquez or DataHub for lineage and ownership. The choice depends on the stack already in flight; the principle — ownership and contracts on every source — does not.

What it works with

Sits inside Data Foundations. Enforces the contracts declared in Ingestion Contracts. Feeds the Source Graph with freshness telemetry. Surfaces quality regressions to the source owner. When quality breaks, Vector Search and Agent Workflows downgrade or skip the affected source until the owner clears the alert.

When you need it

Signals: dashboards regularly disagreeing with source systems; AI assistants citing stale documents; a quality issue diagnosed by Slack archaeology a week later; the same number appearing differently in three reports. Data quality is not cleanup — it is the operating discipline that keeps the substrate trustworthy.

Related resources

Source Graph

A navigable map of every system your data lives in — schemas, documents, code, tickets, events, owners, and permissions — so an AI agent can find the right source and respect the right access boundary.

Vector Search

How an AI agent finds the right document, chunk, or row to ground its answer in — and why the part that matters is the pipeline around the database, not the database itself.

Closed-Loop Knowledge

How an AI system gets durably better at its job — not by being smarter, but by routing every production failure into either a knowledge update, an eval case, a workflow patch, or a documented exception with a named owner.

Ingestion Contracts

Explicit agreements between a data source and the systems that depend on it — what shape, how fresh, who owns it, what counts as broken — so pipeline failures become attributable instead of mysterious.

Lakehouse Architecture

An architecture that combines data-lake economics (cheap object storage, open file formats) with warehouse guarantees (ACID transactions, schema evolution, time travel) so analytics, AI retrieval, and machine learning all read from the same trusted tables.