Service

Self-Optimizing Agents

An optimization loop that proposes variants of your AI workflows — different prompts, models, retrieval, tool budgets, even generated code — and only promotes the ones that improve under quality, latency, cost, and safety gates.

Productized as

Same engine — packaged for a faster start

What it is

Self-Optimizing Agents is an optimization loop that sits next to your production AI workflow and constantly looks for ways to make it better — cheaper, faster, more accurate, more reliable — without you having to manually run experiments. It proposes candidate variants (swap the model on one step, shorten a prompt, deepen retrieval, regenerate a piece of handler code), measures each against the eval set, and surfaces only the variants that improve on at least one axis without regressing on any other. You stay in control: every promotion is reviewed and approved, and every change is traceable. What goes away is the busywork of running experiments by hand.

Why it matters

The AI landscape moves fast. A model that was the best choice in March becomes the second-best in June, and the cheapest one for half the work by September. A prompt that was tuned six months ago is probably wasting tokens. A retrieval depth set on day one is probably either too shallow for the cases that matter or too deep for the cases that don't. Without an optimization loop, this drift is paid for in money and quality every month. With one, the team captures the improvements the landscape keeps offering — quietly, continuously, and on a leash.

What we build

An optimization controller that proposes variants across the surfaces that actually move production metrics: model routing per step, prompt policy, retrieval shape (chunking, hybrid weights, rerank depth, citation rules), tool budgets and timeouts, memory scope, workflow graph structure, and the generated handler code that sits inside steps. The controller runs candidates in parallel against the eval set the Benchmarks service maintains, scores each on quality, p95 latency, dollar cost, memory footprint, tool reliability, and safety, and reports Pareto-dominant survivors. A human approves which to promote; the promotion goes through the same gated path as a hand-authored change.

  • Per-step model routing optimization
  • Prompt and few-shot variant search
  • Retrieval policy tuning (chunk, hybrid, rerank)
  • Tool budget and timeout sweeps
  • Workflow graph mutation
  • Generated handler-code variants
  • Pareto-dominant survivor reporting
  • Human-approved promotion path

How the optimization surface expands

Prompt and program optimization is the visible half of the problem — sweeping wording, few-shot examples, and program structure against a quality metric. The half we add covers the surfaces prompt optimization alone does not touch: workflow shape (which steps run in what order), generated handler code, retrieval policy, tool budgets and timeouts, and model routing per step — all bounded by promotion gates that include latency, memory, cost, and safety, not just task quality. An optimizer is only as honest as the signal it optimizes against. Without a stable eval set, optimization drifts toward whatever looks good on the most recent inputs and degrades on the workloads no one happened to test.

What it works with

Reads from Benchmarks — the eval set is the contract the optimizer is honest against. Writes through Agent Workflows — promoted variants become new versions of the workflow graph. Calls through the AI Platform — variants get measured at real production cost and latency by going through the same gateway. Feeds Closed-Loop Knowledge — when an optimization run reveals a systematic gap, the gap becomes a tracked issue with an owner. Connects to Skill Distillation — the skill documents we write for fresh agents can themselves be optimization targets (which phrasing of an instruction actually causes an apprentice to succeed).

Where we draw the line

Not unattended self-modification. Every promotion requires human approval; the optimizer proposes, the team disposes. Not a fine-tuning service — we do not modify model weights, we change how the system uses the models. Not a magic improvement: optimization is only as good as the eval set it runs against, which is why Benchmarks is the prerequisite. Not a one-time campaign: the loop runs quarterly or continuously as the model landscape moves.

When you should start

Signals: a high-volume workflow where token costs are visible enough to optimize for; a workflow whose quality is plateauing and team experimentation has stalled; a quarterly model-landscape review where 'we should reconsider X' keeps appearing without anything being decided; a workflow with a clear eval set but no systematic way of testing variants against it. Common starting points are cost optimization on a high-volume workflow with an existing eval set (sweep models, prompts, and retrieval for the cheapest variant that still passes), prompt and retrieval tuning on a workflow whose current weakness is well-characterized, or a research engagement where we build the eval set first and then optimize across multiple workflow surfaces.

Related learning