Evaluation

LLM-as-Judge

Using one language model to score the output of another against a rubric — calibrated against human raters to make the scores trustworthy.

Operating principle

Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.

What it is

LLM-as-judge is the practice of using a model to evaluate model outputs. A judge model reads a candidate response and an expected response (or a rubric), and produces a score and a rationale. It's how eval sets scale beyond what humans can rate manually.

Why it matters

Human rating is expensive and slow. LLM-as-judge is cheap and fast — but only useful if its scores match what a careful human would give. Calibration is what makes the difference. A judge model that disagrees with humans 20% of the time is not a judge; it's a tiebreaker that needs review itself.

How it works

We pair the judge with a small human-rated calibration set. The judge scores it; we measure agreement (typically Cohen's kappa or simpler agreement rate). If agreement is high enough, the judge is used on the larger eval set. If not, the rubric is revised or a different judge model is tried. Calibration is rerun on every eval-set change.

Related resources