Data Quality
Contracts, validation, lineage, freshness, and ownership for the data your AI reads from — not a one-time cleanup project, an ongoing operating discipline.
Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.
The substrate determines the agent
If data is stale, ambiguous, unowned, or untrusted, agents amplify the problem at speed. A retrieval pipeline pulling from a table no one has owned in two years is a lawsuit waiting for a citation. Data quality is not cleanup work; it is operational risk management with a name on every source.
How we enforce it
Source contracts declare the input shape, freshness target, owner, and error budget for each ingestion path. Validation runs at write time: failed records hit a dead-letter table with a notification, not silent drops. Freshness, row-count, schema, and null-rate checks run on read at the lakehouse boundary. Quality regressions page the source owner, not the downstream consumer.
- Source contracts with owner and freshness SLO
- Schema, null-rate, and distribution checks
- Dead-letter tables for failed records
- Lineage that names the source owner on failure
Tools that fit
dbt or SQLMesh for transformations with built-in tests; Great Expectations, Soda, or Monte Carlo for data quality; OpenLineage / Marquez or DataHub for lineage and ownership. The choice depends on the stack already in flight; the principle — ownership and contracts on every source — does not.
What it works with
Sits inside Data Foundations. Enforces the contracts declared in Ingestion Contracts. Feeds the Source Graph with freshness telemetry. Surfaces quality regressions to the source owner. When quality breaks, Vector Search and Agent Workflows downgrade or skip the affected source until the owner clears the alert.
When you need it
Signals: dashboards regularly disagreeing with source systems; AI assistants citing stale documents; a quality issue diagnosed by Slack archaeology a week later; the same number appearing differently in three reports. Data quality is not cleanup — it is the operating discipline that keeps the substrate trustworthy.
Related resources
A navigable map of every system your data lives in — schemas, documents, code, tickets, events, owners, and permissions — so an AI agent can find the right source and respect the right access boundary.
How an AI agent finds the right document, chunk, or row to ground its answer in — and why the part that matters is the pipeline around the database, not the database itself.
How an AI system gets durably better at its job — not by being smarter, but by routing every production failure into either a knowledge update, an eval case, a workflow patch, or a documented exception with a named owner.
Explicit agreements between a data source and the systems that depend on it — what shape, how fresh, who owns it, what counts as broken — so pipeline failures become attributable instead of mysterious.
An architecture that combines data-lake economics (cheap object storage, open file formats) with warehouse guarantees (ACID transactions, schema evolution, time travel) so analytics, AI retrieval, and machine learning all read from the same trusted tables.