Evaluation

Prompt and Model Diffs

Side-by-side measurement of a candidate prompt or model against the current production version on the same eval set — the unit of safe change in a serious AI workflow.

Operating principle

Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.

What a diff measures

Quality (rubric-scored or LLM-as-judge-scored against gold), regressions (any failure on the curated regression set), latency (p50, p95, p99), cost (input + output tokens times price), and behavior shift (semantic similarity of outputs to the baseline — when outputs change shape, the change should be intentional).

Quality on the gold set with rubric or LLM-as-judge
Regression count on the curated failure set
Latency distribution (p50, p95, p99)
Cost per request and behavior shift vs baseline

Why diffs, not single scores

A score in isolation tells you 'better than nothing'. A diff against the current production version tells you whether to ship — including whether the new version is better in ways that matter and not worse in ways that hide. Diffs are the artifact a reviewer signs off on; scores are the input to the diff.

LLM-as-judge calibration

Judge models drift across versions and rubrics drift across reviewers. We calibrate the judge against a small human-rated set on every eval-set change, and we report inter-rater agreement so reviewers know how much to trust the score. A judge that disagrees with humans more than 15% of the time is not a judge; it is a tiebreaker that needs review itself.

Related resources

Workflow Evals

The test suite your AI workflows have to pass before any change reaches users — measuring quality, latency, cost, and safety on real production data instead of vibes.

Promotion Gates

The thresholds an AI change must clear before it reaches production — quality, latency, cost, memory, safety — enforced by CI, not by hope.

Model Routing

How an AI system decides which model to call for each step — based on privacy, cost, latency, quality, and what happens when a provider goes down.