LLM-as-Judge
Using one language model to score the output of another against a rubric — calibrated against human raters to make the scores trustworthy.
Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.
What it is
LLM-as-judge is the practice of using a model to evaluate model outputs. A judge model reads a candidate response and an expected response (or a rubric), and produces a score and a rationale. It's how eval sets scale beyond what humans can rate manually.
Why it matters
Human rating is expensive and slow. LLM-as-judge is cheap and fast — but only useful if its scores match what a careful human would give. Calibration is what makes the difference. A judge model that disagrees with humans 20% of the time is not a judge; it's a tiebreaker that needs review itself.
How it works
We pair the judge with a small human-rated calibration set. The judge scores it; we measure agreement (typically Cohen's kappa or simpler agreement rate). If agreement is high enough, the judge is used on the larger eval set. If not, the rubric is revised or a different judge model is tried. Calibration is rerun on every eval-set change.
Related resources
Short for 'evaluations' — the test cases and harness that measure whether an AI workflow is working, before and after every change.
The software that runs an eval set against a candidate AI variant and produces scores — quality, latency, cost, safety — for promotion decisions.
Scoring AI workflow traces — not just final outputs — to detect quality issues at the step level: bad retrievals, wrong tool calls, low-confidence reasoning.
The test suite your AI workflows have to pass before any change reaches users — measuring quality, latency, cost, and safety on real production data instead of vibes.