Evaluation

Promotion Gates

The thresholds an AI change must clear before it reaches production — quality, latency, cost, memory, safety — enforced by CI, not by hope.

Operating principle

Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.

What gates encode

A passing diff on the eval set (no regressions, quality at or above baseline), p95 latency within budget, cost-per-request within budget, memory and tool reliability within budget, safety classifier scores within budget, and behavior shift small enough to not require a release note — or, if larger, with a release note attached.

Quality: no regressions, baseline or better
Latency: p95 within the workflow's budget
Cost: per-request budget enforced
Safety: classifier scores and refusal-rate bounds

Gate vs experiment

Gates block changes that would degrade production. Experiments measure changes that are already safe enough to ship to a slice of traffic. A gate that fails sends the change back to development; an experiment that fails sends the change back through the gate with new evidence about why it should have failed earlier.

Tightening over time

Initial gates are usually loose because the eval set is small. As the eval set grows from production, gates tighten: more regression cases, lower latency budget, tighter cost. The gate is not the contract with users (the SLO is); the gate is the contract with the team shipping changes.

Related resources

Prompt and Model Diffs

Side-by-side measurement of a candidate prompt or model against the current production version on the same eval set — the unit of safe change in a serious AI workflow.

Workflow Evals

The test suite your AI workflows have to pass before any change reaches users — measuring quality, latency, cost, and safety on real production data instead of vibes.

Self-Optimizing Agents

AI workflows that propose, score, and promote their own variants — prompts, models, retrieval policies, tool budgets, generated code — under measurable constraints instead of intuition or vendor leaderboards.