Learn

Closed-Loop Knowledge

How an AI system gets durably better at its job — not by being smarter, but by routing every production failure into either a knowledge update, an eval case, a workflow patch, or a documented exception with a named owner.

What it is

Closed-Loop Knowledge is the discipline that turns every production failure of an AI system into a tracked improvement instead of a forgotten anecdote. A failed answer is either evidence to update the knowledge base, a regression case for the eval set, a candidate workflow patch, or a documented exception with a named owner — and the failure rate becomes a metric the team manages over time. Without the loop, the same mistake happens twice, twenty times, indefinitely; with the loop, the system compounds: it gets quietly, measurably better each week.

Why it matters

Most AI systems don't improve in production. They improve when a model upgrade ships, or when an engineer does a quarterly tune-up. Between those events, they drift in whatever direction usage pushes them — usually toward edge cases the test set never covered. A closed loop changes the slope. The team's craft, the corner cases users actually encounter, the language they actually use, the documents they actually need — all of it becomes durable improvement rather than evaporated knowledge.

How it actually works

A weekly or daily review surfaces clusters of related failures: recurring questions the agent fails on, slow tool calls, low-confidence outputs that humans corrected, retrieved chunks that turned out to be the wrong source. Each cluster has a named owner from product, engineering, support, or ops, and the owner makes a decision: knowledge update (write or correct the underlying source), workflow change (a routing or prompt fix that goes through Benchmarks), eval addition (lock the case in so it cannot regress silently), or documented exception (we know about it; here's why we won't fix it now). Every decision is logged. The metric is the rate at which new failure clusters appear minus the rate at which existing clusters close.

What it works with

Reads input from Conversation Intelligence (clusters of bad threads), Workflow Evals (failures that block promotion), Observability (slow or unreliable spans), and Agent Workflows (human corrections in approval inboxes). Writes output back to Data Foundations (knowledge base updates, new sources), Benchmarks (regression cases), Agent Workflows (workflow changes), and Conversation Intelligence (closed clusters become closed). The loop touches every other layer; it is the operational discipline that makes the rest of the system compound.

Open questions we are studying

What is the smallest meaningful unit of a 'cluster' — a single thread, a hand-curated group, an embedding-similarity blob? When a fix lands, how do we verify it actually closed the cluster rather than displacing the failure elsewhere? How do we attribute compounding gains to the loop versus to model upgrades that happened in the same week? At what scale of organization does the loop need to fragment into per-team loops with their own owners?

Prior art and adjacent work

Adjacent to RLHF data pipelines and the broader work on production-derived eval sets. Engages with continuous-evaluation practice and the test-driven discipline of software engineering. Borrows the ownership-graph patterns from SRE incident response. Related research on automated eval generation, regression-set growth, and online learning under safety constraints.