Data Foundations
One trusted set of data — for your dashboards, your AI, and your agents — with clear owners, freshness, and access controls from the moment data lands to the moment it's read.
What it is
Data Foundations is the work of giving your business one trustworthy set of data that every system can read — your dashboards, your AI assistants, your agents, your reports, and your operational tools. Today most organizations have the same information stored in three or four places: a warehouse for analytics, a separate copy for AI experiments, a few spreadsheets nobody owns, and the original systems where the data was created. Those copies drift, the numbers disagree, and nobody can say with confidence which one is right. We replace that with a single governed substrate — and the operating discipline that keeps it accurate, fresh, owned, and properly access-controlled. Once it exists, every future AI use case stops needing its own data pipeline.
Why it matters
AI is only as honest as the data underneath it. An agent that retrieves from a stale document, cites a number that doesn't match the dashboard, or reads from a table that the requesting user is not allowed to see is not a clever AI problem — it is a data substrate problem. The most common reason AI pilots fail to reach production is not the model. It is that the team building the agent does not have a clean, governed, retrieval-ready source to point it at, so the pilot succeeds on a curated demo dataset and breaks on real operations. Data Foundations is what makes the pilot-to-production gap closeable.
The umbrella, and its two sub-services
Data Foundations is the strategic umbrella. Two concrete delivery surfaces sit underneath it. Data Lake & Lakehouse is the storage and ingestion substrate — the open-format tables, the streaming pipelines, the contracts that keep data honest as it lands. LLM-Ready Knowledge Base is the governed corpus layer — the chunked, embedded, permission-aware retrieval surface that agents actually read from. Most engagements involve both: the lakehouse holds your structured truth, the knowledge base holds your documents, and they share lineage, ownership, and access controls so an agent or analyst points at one substrate, not five.
- Source graph — who owns what, where it flows, who depends on it
- Operating discipline — ingestion contracts, freshness SLOs, named owners
- Lineage and quality — column-level lineage, automated impact analysis
- Retrieval-readiness — chunks, embeddings, permission propagation
- Two sub-services: Data Lake & Lakehouse, LLM-Ready Knowledge Base
- One substrate, queried by analytics, AI, and operational tools alike
How it actually works
Every engagement starts with an inventory: we walk the operational systems already in flight — warehouses, OLTP databases, document stores, ticketing, code repositories, observability — and produce a source graph: a navigable map of where information actually lives, who owns it, how it flows, and which downstream consumers depend on it. From that map we collapse duplicate ingestion paths, write source contracts (input shape, freshness target, owner, error budget, breakage policy), and decide which sub-services to stand up first. Quality gates run at ingestion; failed records go to a dead-letter table with an automatic notification to the source owner — not silently dropped. Retrieval-ready transformations sit alongside the analytical tables, so the same source of truth powers both the BI dashboard and the agent's citation.
What it works with
Data Foundations is the bottom of the stack — everything else reads from it. The AI Platform points its model gateway, retrieval pipelines, and MCP tool registry at the foundations to serve agents. Agent Workflows use it to gather context before they act, and write durable outcomes back to it. Conversation Intelligence lands its captured threads in the foundations so signals stay governed. Closed-Loop Knowledge updates flow back into the foundations as new source-graph nodes and refreshed indexes. Workflow Evals build their gold and regression sets from production traces stored here. If any of those layers is going to be reliable, this layer has to exist first.
What this is not
Not a one-time migration project. Source ownership, freshness contracts, and quality gates are an operating discipline — we set the discipline up and transfer it; we don't run it forever. Not a data lake in the older sense: a lake without contracts and lineage becomes a swamp within a year. Not vendor lock-in: open table formats and a model-agnostic catalog mean you can change compute engines without rewriting the substrate. Not an AI-only project: the same foundations serve finance reporting, operations dashboards, and analyst self-service — the AI use cases just happen to be the ones that catch the failures first.
When you should start
Concrete signals that you need this work: a number on the executive dashboard regularly disagrees with the number in the underlying system; AI pilots succeed on demo data and stall when pointed at production; new AI use cases each require a net-new pipeline because no governed substrate exists; data quality issues are diagnosed by Slack archaeology instead of by an owned alert; a regulator or customer has asked who can read which data and the answer is 'we'll check'. Common starting points are standing up the lakehouse sub-service so analytics and AI share a source of truth, building the knowledge-base sub-service so unstructured content joins the same governance, or starting with the source graph alone — a 4-week diagnostic that maps the current state before deciding what to consolidate first.
Related learning
A capability in the Group e-media information AI stack. This resource connects the subject to data substrate, agent runtime, evals, and operations.
A company knowledge base built so an AI system can cite real answers from it — sourced from documents, tickets, code, conversations, and structured records; chunked, embedded, permissioned, evaluated, and kept fresh on AWS.
A navigable map of every system your data lives in — schemas, documents, code, tickets, events, owners, and permissions — so an AI agent can find the right source and respect the right access boundary.
Contracts, validation, lineage, freshness, and ownership for the data your AI reads from — not a one-time cleanup project, an ongoing operating discipline.
Explicit agreements between a data source and the systems that depend on it — what shape, how fresh, who owns it, what counts as broken — so pipeline failures become attributable instead of mysterious.