Learn

Self-Optimizing Agents

AI workflows that propose, score, and promote their own variants — prompts, models, retrieval policies, tool budgets, generated code — under measurable constraints instead of intuition or vendor leaderboards.

How optimization works

An optimization loop proposes variants — swap a model, tighten retrieval, edit a prompt, restructure the graph, regenerate a code-handler — runs them in parallel against the eval set, and scores each on the agreed axes. Pareto-dominant candidates are surfaced for review; a human decides which to promote and the result is a versioned change with a traceable provenance.

What can change

Model routing, prompt policy, retrieval shape (chunking, hybrid weights, rerank depth, citation rules), tool budgets and timeouts, memory scope, node structure, and generated handler code. Each surface has its own search space and its own safety constraints.

Why it needs the eval set

Optimization without a stable eval set is drift with a budget attached. The system will improve on whatever it can measure and degrade on what it cannot. Building and maintaining the eval set — adding new regression cases, refreshing gold examples, recalibrating the LLM-as-judge — is the discipline that makes optimization safe.

Beyond prompt and program optimization

Prompt sweeps, few-shot bootstrapping, and program-structure search are the well-studied half of the problem. The surfaces we add to the search space are the ones that move latency, reliability, and dollars in production: graph shape, retrieval policy (chunking, hybrid weights, rerank depth), tool budgets and timeouts, model routing per step, and generated handler code — each governed by the same promotion gates as a hand-authored change.

What it works with

Reads from Workflow Evals — the eval set is the contract the optimizer is honest against. Calls through the AI Platform — variants are measured at real production cost and latency. Writes through Agent Workflows — promoted variants become new versions of the workflow graph. Connects to Closed-Loop Knowledge — when an optimization reveals a systematic gap, the gap becomes a tracked issue.

When you need it

Signals: a high-volume AI workflow where token costs justify continuous tuning; a quarterly model-landscape review where 'we should reconsider X' keeps appearing without anything being decided; an existing eval set that the team is not yet using to systematically score variants.