Evaluation

Promotion Gates

The thresholds an AI change must clear before it reaches production — quality, latency, cost, memory, safety — enforced by CI, not by hope.

Operating principle

Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.

What gates encode

A passing diff on the eval set (no regressions, quality at or above baseline), p95 latency within budget, cost-per-request within budget, memory and tool reliability within budget, safety classifier scores within budget, and behavior shift small enough to not require a release note — or, if larger, with a release note attached.

  • Quality: no regressions, baseline or better
  • Latency: p95 within the workflow's budget
  • Cost: per-request budget enforced
  • Safety: classifier scores and refusal-rate bounds

Gate vs experiment

Gates block changes that would degrade production. Experiments measure changes that are already safe enough to ship to a slice of traffic. A gate that fails sends the change back to development; an experiment that fails sends the change back through the gate with new evidence about why it should have failed earlier.

Tightening over time

Initial gates are usually loose because the eval set is small. As the eval set grows from production, gates tighten: more regression cases, lower latency budget, tighter cost. The gate is not the contract with users (the SLO is); the gate is the contract with the team shipping changes.

Related resources