Evaluation

Prompt and Model Diffs

Side-by-side measurement of a candidate prompt or model against the current production version on the same eval set — the unit of safe change in a serious AI workflow.

Operating principle

Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.

What a diff measures

Quality (rubric-scored or LLM-as-judge-scored against gold), regressions (any failure on the curated regression set), latency (p50, p95, p99), cost (input + output tokens times price), and behavior shift (semantic similarity of outputs to the baseline — when outputs change shape, the change should be intentional).

  • Quality on the gold set with rubric or LLM-as-judge
  • Regression count on the curated failure set
  • Latency distribution (p50, p95, p99)
  • Cost per request and behavior shift vs baseline

Why diffs, not single scores

A score in isolation tells you 'better than nothing'. A diff against the current production version tells you whether to ship — including whether the new version is better in ways that matter and not worse in ways that hide. Diffs are the artifact a reviewer signs off on; scores are the input to the diff.

LLM-as-judge calibration

Judge models drift across versions and rubrics drift across reviewers. We calibrate the judge against a small human-rated set on every eval-set change, and we report inter-rater agreement so reviewers know how much to trust the score. A judge that disagrees with humans more than 15% of the time is not a judge; it is a tiebreaker that needs review itself.

Related resources