Prompt and Model Diffs
Side-by-side measurement of a candidate prompt or model against the current production version on the same eval set — the unit of safe change in a serious AI workflow.
Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.
What a diff measures
Quality (rubric-scored or LLM-as-judge-scored against gold), regressions (any failure on the curated regression set), latency (p50, p95, p99), cost (input + output tokens times price), and behavior shift (semantic similarity of outputs to the baseline — when outputs change shape, the change should be intentional).
- Quality on the gold set with rubric or LLM-as-judge
- Regression count on the curated failure set
- Latency distribution (p50, p95, p99)
- Cost per request and behavior shift vs baseline
Why diffs, not single scores
A score in isolation tells you 'better than nothing'. A diff against the current production version tells you whether to ship — including whether the new version is better in ways that matter and not worse in ways that hide. Diffs are the artifact a reviewer signs off on; scores are the input to the diff.
LLM-as-judge calibration
Judge models drift across versions and rubrics drift across reviewers. We calibrate the judge against a small human-rated set on every eval-set change, and we report inter-rater agreement so reviewers know how much to trust the score. A judge that disagrees with humans more than 15% of the time is not a judge; it is a tiebreaker that needs review itself.
Related resources
The test suite your AI workflows have to pass before any change reaches users — measuring quality, latency, cost, and safety on real production data instead of vibes.
The thresholds an AI change must clear before it reaches production — quality, latency, cost, memory, safety — enforced by CI, not by hope.
How an AI system decides which model to call for each step — based on privacy, cost, latency, quality, and what happens when a provider goes down.