Data substrate

Data Quality

Contracts, validation, lineage, freshness, and ownership for the data your AI reads from — not a one-time cleanup project, an ongoing operating discipline.

Operating principle

Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.

The substrate determines the agent

If data is stale, ambiguous, unowned, or untrusted, agents amplify the problem at speed. A retrieval pipeline pulling from a table no one has owned in two years is a lawsuit waiting for a citation. Data quality is not cleanup work; it is operational risk management with a name on every source.

How we enforce it

Source contracts declare the input shape, freshness target, owner, and error budget for each ingestion path. Validation runs at write time: failed records hit a dead-letter table with a notification, not silent drops. Freshness, row-count, schema, and null-rate checks run on read at the lakehouse boundary. Quality regressions page the source owner, not the downstream consumer.

  • Source contracts with owner and freshness SLO
  • Schema, null-rate, and distribution checks
  • Dead-letter tables for failed records
  • Lineage that names the source owner on failure

Tools that fit

dbt or SQLMesh for transformations with built-in tests; Great Expectations, Soda, or Monte Carlo for data quality; OpenLineage / Marquez or DataHub for lineage and ownership. The choice depends on the stack already in flight; the principle — ownership and contracts on every source — does not.

What it works with

Sits inside Data Foundations. Enforces the contracts declared in Ingestion Contracts. Feeds the Source Graph with freshness telemetry. Surfaces quality regressions to the source owner. When quality breaks, Vector Search and Agent Workflows downgrade or skip the affected source until the owner clears the alert.

When you need it

Signals: dashboards regularly disagreeing with source systems; AI assistants citing stale documents; a quality issue diagnosed by Slack archaeology a week later; the same number appearing differently in three reports. Data quality is not cleanup — it is the operating discipline that keeps the substrate trustworthy.

Related resources