Evaluation

Self-Optimizing Agents

AI workflows that propose, score, and promote their own variants — prompts, models, retrieval policies, tool budgets, generated code — under measurable constraints instead of intuition or vendor leaderboards.

Operating principle

Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.

How optimization works

An optimization loop proposes variants — swap a model, tighten retrieval, edit a prompt, restructure the graph, regenerate a code-handler — runs them in parallel against the eval set, and scores each on the agreed axes. Pareto-dominant candidates are surfaced for review; a human decides which to promote and the result is a versioned change with a traceable provenance.

What can change

Model routing, prompt policy, retrieval shape (chunking, hybrid weights, rerank depth, citation rules), tool budgets and timeouts, memory scope, node structure, and generated handler code. Each surface has its own search space and its own safety constraints.

Why it needs the eval set

Optimization without a stable eval set is drift with a budget attached. The system will improve on whatever it can measure and degrade on what it cannot. Building and maintaining the eval set — adding new regression cases, refreshing gold examples, recalibrating the LLM-as-judge — is the discipline that makes optimization safe.

Beyond prompt and program optimization

Prompt sweeps, few-shot bootstrapping, and program-structure search are the well-studied half of the problem. The surfaces we add to the search space are the ones that move latency, reliability, and dollars in production: graph shape, retrieval policy (chunking, hybrid weights, rerank depth), tool budgets and timeouts, model routing per step, and generated handler code — each governed by the same promotion gates as a hand-authored change.

What it works with

Reads from Workflow Evals — the eval set is the contract the optimizer is honest against. Calls through the AI Platform — variants are measured at real production cost and latency. Writes through Agent Workflows — promoted variants become new versions of the workflow graph. Connects to Closed-Loop Knowledge — when an optimization reveals a systematic gap, the gap becomes a tracked issue.

When you need it

Signals: a high-volume AI workflow where token costs justify continuous tuning; a quarterly model-landscape review where 'we should reconsider X' keeps appearing without anything being decided; an existing eval set that the team is not yet using to systematically score variants.

Related resources

Workflow Evals

The test suite your AI workflows have to pass before any change reaches users — measuring quality, latency, cost, and safety on real production data instead of vibes.

Model Routing

How an AI system decides which model to call for each step — based on privacy, cost, latency, quality, and what happens when a provider goes down.

Agent Observability

Trace-level visibility into every model call, retrieval, tool invocation, decision, approval, and failure inside an AI workflow — the substrate every other discipline (evals, optimization, governance) reads from.

Prompt and Model Diffs

Side-by-side measurement of a candidate prompt or model against the current production version on the same eval set — the unit of safe change in a serious AI workflow.

Promotion Gates

The thresholds an AI change must clear before it reaches production — quality, latency, cost, memory, safety — enforced by CI, not by hope.