Evaluation

Evaluation Harness

The software that runs an eval set against a candidate AI variant and produces scores — quality, latency, cost, safety — for promotion decisions.

Operating principle

Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.

What it is

An evaluation harness is the runner: it takes an eval set, a candidate (a prompt, a model, a workflow, a retrieval policy), and the scoring rubrics, then executes the eval in parallel and emits a structured report. The output is what the team uses to decide whether the candidate is better than what's in production.

Why it matters

Building an eval set is one half of the work. The other half is having a harness that runs it reliably, in parallel, with calibrated scoring and proper attribution. A great eval set with a flaky harness produces scores nobody trusts.

How it works

Reads the eval set from version control. Spawns parallel workers that call the candidate through the AI Platform gateway (so cost and latency reflect production). Applies rubric-based scoring or LLM-as-judge scoring with human calibration. Emits a structured diff against the baseline. Promotion gates consume the diff.

Related resources