Benchmarks
A test set built from your production traffic that catches AI regressions before they ship. Prompt or model changes have to pass quality, latency, cost, and safety thresholds before users see them. Requires production traces to draw from.
What it is
Benchmarks is the practice of building a test set from your real production traffic and running every candidate change against it before the change reaches users. If a developer wants to swap a model, edit a prompt, change how retrieval works, or modify a workflow, the change first has to pass a battery of measurements: did quality stay high or improve, did latency stay within budget, did cost stay within budget, did safety classifiers still pass, did any previously-fixed bug come back. Only changes that clear all of those gates are allowed to ship. This is what makes the difference between an AI system that improves predictably and one that gets quietly worse every Tuesday.
Why it matters
Most AI quality regressions are invisible until a user complains. A small prompt edit that 'looked fine in testing' degrades a thousand workflows before anyone notices. A cheaper model that 'works in the demo' fails on edge cases the demo didn't include. A new retrieval policy that 'helped one example' breaks five others nobody re-checked. Without a test set built from real production data, every change is a small gamble. With one, changes either improve the metrics that matter or they don't ship.
What we build
An eval set composed of: gold examples drawn from production successes (with PII handling), regressions seeded from production failures, synthetic adversarial cases for known edge classes, and a small human-rated calibration set used to validate the LLM-as-judge rubric. A harness that runs candidate variants in parallel, scores each on multiple axes (task quality, p95 and p99 latency, token and dollar cost, memory footprint, tool reliability, safety classifier scores), and produces a diff against the current production version. Promotion gates that block changes which regress, plus drift alerts that compare production behavior to the eval baseline on a schedule.
- Eval sets built from production traces and human-labeled cases
- Prompt, model (hosted, local, or self-hosted), retrieval, and workflow diff testing
- LLM-as-judge with calibration against human raters
- Quality, latency, cost, and safety scoring
- Promotion gates enforced in CI
- Drift alerts on production vs eval baseline
- Regression cases added on every triaged failure
- Per-route attribution for cost and quality
How it actually works
Every production workflow is instrumented so traces capture the full input set the eval needs: event payload, retrieved context, tool calls, model outputs, final result. Traces are triaged into gold (clear successes), regressions (clear failures), and ambiguous cases that get human labels. The eval set is itself a versioned artifact — adding a case goes through review. The harness uses the set as a treaty between developers and operations: changes propose, the harness measures, a human approves or rejects based on the diff. We build on Inspect AI (UK AISI's eval framework), OpenAI Evals, and Promptfoo when they fit, and write custom harnesses when the workflow shape demands it.
What it works with
Reads from Data Foundations (gold and regression cases live in governed tables alongside their traces). Runs through the AI Platform (the harness calls models through the same gateway production does, so cost and latency numbers are real). Gates Agent Workflows (a workflow change does not promote until the eval set passes). Feeds Self-Optimizing Agents (variants that the optimizer produces are scored by the same harness). Receives input from Conversation Intelligence (failure clusters from production conversations become regression cases). The whole system gets quieter when the eval set gets stronger.
Where we draw the line
We measure your workflow on your data, not model capability on a public leaderboard. Pre-promotion eval catches the regressions an A/B test would surface only after the damage — A/B stays for measuring user-facing impact, but it is too slow and too risky as the primary release gate. And the eval set is never one-and-done: it grows with the system as the gates tighten with confidence.
When you should start
Signals: prompt or model changes that ship straight to production with no measurement; quality regressions that are discovered when users complain; the same bug appearing twice because nobody added it to a regression set; an AI assistant whose 'getting worse' or 'getting better' is debated by team perception rather than data. Common starting points are standing up a first eval set from the last 60 days of production traces and running a baseline, introducing promotion gates so prompt and model changes can no longer ship without a passing diff, or building drift alerts that compare production quality to the eval baseline weekly.
Related learning
The test suite your AI workflows have to pass before any change reaches users — measuring quality, latency, cost, and safety on real production data instead of vibes.
AI workflows that propose, score, and promote their own variants — prompts, models, retrieval policies, tool budgets, generated code — under measurable constraints instead of intuition or vendor leaderboards.
How an AI system decides which model to call for each step — based on privacy, cost, latency, quality, and what happens when a provider goes down.
Side-by-side measurement of a candidate prompt or model against the current production version on the same eval set — the unit of safe change in a serious AI workflow.
The thresholds an AI change must clear before it reaches production — quality, latency, cost, memory, safety — enforced by CI, not by hope.