Promotion Gates
The thresholds an AI change must clear before it reaches production — quality, latency, cost, memory, safety — enforced by CI, not by hope.
Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.
What gates encode
A passing diff on the eval set (no regressions, quality at or above baseline), p95 latency within budget, cost-per-request within budget, memory and tool reliability within budget, safety classifier scores within budget, and behavior shift small enough to not require a release note — or, if larger, with a release note attached.
- Quality: no regressions, baseline or better
- Latency: p95 within the workflow's budget
- Cost: per-request budget enforced
- Safety: classifier scores and refusal-rate bounds
Gate vs experiment
Gates block changes that would degrade production. Experiments measure changes that are already safe enough to ship to a slice of traffic. A gate that fails sends the change back to development; an experiment that fails sends the change back through the gate with new evidence about why it should have failed earlier.
Tightening over time
Initial gates are usually loose because the eval set is small. As the eval set grows from production, gates tighten: more regression cases, lower latency budget, tighter cost. The gate is not the contract with users (the SLO is); the gate is the contract with the team shipping changes.
Related resources
Side-by-side measurement of a candidate prompt or model against the current production version on the same eval set — the unit of safe change in a serious AI workflow.
The test suite your AI workflows have to pass before any change reaches users — measuring quality, latency, cost, and safety on real production data instead of vibes.
AI workflows that propose, score, and promote their own variants — prompts, models, retrieval policies, tool budgets, generated code — under measurable constraints instead of intuition or vendor leaderboards.