Evaluation

LLM-as-Judge

Using one language model to score the output of another against a rubric — calibrated against human raters to make the scores trustworthy.

Operating principle

Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.

What it is

LLM-as-judge is the practice of using a model to evaluate model outputs. A judge model reads a candidate response and an expected response (or a rubric), and produces a score and a rationale. It's how eval sets scale beyond what humans can rate manually.

Why it matters

Human rating is expensive and slow. LLM-as-judge is cheap and fast — but only useful if its scores match what a careful human would give. Calibration is what makes the difference. A judge model that disagrees with humans 20% of the time is not a judge; it's a tiebreaker that needs review itself.

How it works

We pair the judge with a small human-rated calibration set. The judge scores it; we measure agreement (typically Cohen's kappa or simpler agreement rate). If agreement is high enough, the judge is used on the larger eval set. If not, the rubric is revised or a different judge model is tried. Calibration is rerun on every eval-set change.

Related resources

Evals

Short for 'evaluations' — the test cases and harness that measure whether an AI workflow is working, before and after every change.

Evaluation Harness

The software that runs an eval set against a candidate AI variant and produces scores — quality, latency, cost, safety — for promotion decisions.

Trace Grading

Scoring AI workflow traces — not just final outputs — to detect quality issues at the step level: bad retrievals, wrong tool calls, low-confidence reasoning.

Workflow Evals

The test suite your AI workflows have to pass before any change reaches users — measuring quality, latency, cost, and safety on real production data instead of vibes.

LLM-as-Judge

What it is

Why it matters

How it works

Related concepts

Related resources