Self-Optimizing Agents
An optimization loop that proposes variants of your AI workflows — different prompts, models, retrieval, tool budgets, even generated code — and only promotes the ones that improve under quality, latency, cost, and safety gates.
What it is
Self-Optimizing Agents is an optimization loop that sits next to your production AI workflow and constantly looks for ways to make it better — cheaper, faster, more accurate, more reliable — without you having to manually run experiments. It proposes candidate variants (swap the model on one step, shorten a prompt, deepen retrieval, regenerate a piece of handler code), measures each against the eval set, and surfaces only the variants that improve on at least one axis without regressing on any other. You stay in control: every promotion is reviewed and approved, and every change is traceable. What goes away is the busywork of running experiments by hand.
Why it matters
The AI landscape moves fast. A model that was the best choice in March becomes the second-best in June, and the cheapest one for half the work by September. A prompt that was tuned six months ago is probably wasting tokens. A retrieval depth set on day one is probably either too shallow for the cases that matter or too deep for the cases that don't. Without an optimization loop, this drift is paid for in money and quality every month. With one, the team captures the improvements the landscape keeps offering — quietly, continuously, and on a leash.
What we build
An optimization controller that proposes variants across the surfaces that actually move production metrics: model routing per step, prompt policy, retrieval shape (chunking, hybrid weights, rerank depth, citation rules), tool budgets and timeouts, memory scope, workflow graph structure, and the generated handler code that sits inside steps. The controller runs candidates in parallel against the eval set the Benchmarks service maintains, scores each on quality, p95 latency, dollar cost, memory footprint, tool reliability, and safety, and reports Pareto-dominant survivors. A human approves which to promote; the promotion goes through the same gated path as a hand-authored change.
- Per-step model routing optimization
- Prompt and few-shot variant search
- Retrieval policy tuning (chunk, hybrid, rerank)
- Tool budget and timeout sweeps
- Workflow graph mutation
- Generated handler-code variants
- Pareto-dominant survivor reporting
- Human-approved promotion path
How the optimization surface expands
Prompt and program optimization is the visible half of the problem — sweeping wording, few-shot examples, and program structure against a quality metric. The half we add covers the surfaces prompt optimization alone does not touch: workflow shape (which steps run in what order), generated handler code, retrieval policy, tool budgets and timeouts, and model routing per step — all bounded by promotion gates that include latency, memory, cost, and safety, not just task quality. An optimizer is only as honest as the signal it optimizes against. Without a stable eval set, optimization drifts toward whatever looks good on the most recent inputs and degrades on the workloads no one happened to test.
What it works with
Reads from Benchmarks — the eval set is the contract the optimizer is honest against. Writes through Agent Workflows — promoted variants become new versions of the workflow graph. Calls through the AI Platform — variants get measured at real production cost and latency by going through the same gateway. Feeds Closed-Loop Knowledge — when an optimization run reveals a systematic gap, the gap becomes a tracked issue with an owner. Connects to Skill Distillation — the skill documents we write for fresh agents can themselves be optimization targets (which phrasing of an instruction actually causes an apprentice to succeed).
Where we draw the line
Not unattended self-modification. Every promotion requires human approval; the optimizer proposes, the team disposes. Not a fine-tuning service — we do not modify model weights, we change how the system uses the models. Not a magic improvement: optimization is only as good as the eval set it runs against, which is why Benchmarks is the prerequisite. Not a one-time campaign: the loop runs quarterly or continuously as the model landscape moves.
When you should start
Signals: a high-volume workflow where token costs are visible enough to optimize for; a workflow whose quality is plateauing and team experimentation has stalled; a quarterly model-landscape review where 'we should reconsider X' keeps appearing without anything being decided; a workflow with a clear eval set but no systematic way of testing variants against it. Common starting points are cost optimization on a high-volume workflow with an existing eval set (sweep models, prompts, and retrieval for the cheapest variant that still passes), prompt and retrieval tuning on a workflow whose current weakness is well-characterized, or a research engagement where we build the eval set first and then optimize across multiple workflow surfaces.
Related learning
A capability in the Group e-media information AI stack. This resource connects the subject to data substrate, agent runtime, evals, and operations.
The test suite your AI workflows have to pass before any change reaches users — measuring quality, latency, cost, and safety on real production data instead of vibes.
Side-by-side measurement of a candidate prompt or model against the current production version on the same eval set — the unit of safe change in a serious AI workflow.
The thresholds an AI change must clear before it reaches production — quality, latency, cost, memory, safety — enforced by CI, not by hope.
How an AI system decides which model to call for each step — based on privacy, cost, latency, quality, and what happens when a provider goes down.