Evaluation

Evals

Short for 'evaluations' — the test cases and harness that measure whether an AI workflow is working, before and after every change.

Operating principle

Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.

What it is

Evals are the test suite for AI. Where traditional software has unit tests, integration tests, and CI, AI workflows have evals: a curated set of inputs with known-good outputs (or rubrics), plus the harness that runs candidate prompts, models, retrieval policies, or workflow shapes against the set and scores the results.

Why it matters

Without evals, every AI change is a hopeful gamble. With evals, changes either pass the bar or they don't ship. The discipline is what separates AI teams that improve predictably from teams that ship regressions and discover them through customer complaints.

How it works

The eval set is built from gold examples (production successes), regressions (production failures), synthetic adversarial cases, and a human-rated calibration set. The harness runs candidates in parallel, scores each on quality, latency, cost, and safety, and produces a diff. Common substrates: Inspect AI (UK AISI), OpenAI Evals, Promptfoo, plus custom harnesses.

Related resources

Evaluation Harness

The software that runs an eval set against a candidate AI variant and produces scores — quality, latency, cost, safety — for promotion decisions.

Workflow Evals

The test suite your AI workflows have to pass before any change reaches users — measuring quality, latency, cost, and safety on real production data instead of vibes.

LLM-as-Judge

Using one language model to score the output of another against a rubric — calibrated against human raters to make the scores trustworthy.

Promotion Gates

The thresholds an AI change must clear before it reaches production — quality, latency, cost, memory, safety — enforced by CI, not by hope.

Trace Grading

Scoring AI workflow traces — not just final outputs — to detect quality issues at the step level: bad retrievals, wrong tool calls, low-confidence reasoning.

Evals

What it is

Why it matters

How it works

Related concepts

Related resources