Evaluation Harness
The software that runs an eval set against a candidate AI variant and produces scores — quality, latency, cost, safety — for promotion decisions.
Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.
What it is
An evaluation harness is the runner: it takes an eval set, a candidate (a prompt, a model, a workflow, a retrieval policy), and the scoring rubrics, then executes the eval in parallel and emits a structured report. The output is what the team uses to decide whether the candidate is better than what's in production.
Why it matters
Building an eval set is one half of the work. The other half is having a harness that runs it reliably, in parallel, with calibrated scoring and proper attribution. A great eval set with a flaky harness produces scores nobody trusts.
How it works
Reads the eval set from version control. Spawns parallel workers that call the candidate through the AI Platform gateway (so cost and latency reflect production). Applies rubric-based scoring or LLM-as-judge scoring with human calibration. Emits a structured diff against the baseline. Promotion gates consume the diff.
Related resources
Short for 'evaluations' — the test cases and harness that measure whether an AI workflow is working, before and after every change.
Using one language model to score the output of another against a rubric — calibrated against human raters to make the scores trustworthy.
The thresholds an AI change must clear before it reaches production — quality, latency, cost, memory, safety — enforced by CI, not by hope.
The test suite your AI workflows have to pass before any change reaches users — measuring quality, latency, cost, and safety on real production data instead of vibes.