Evaluation

Evaluation Harness

The software that runs an eval set against a candidate AI variant and produces scores — quality, latency, cost, safety — for promotion decisions.

Operating principle

Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.

What it is

An evaluation harness is the runner: it takes an eval set, a candidate (a prompt, a model, a workflow, a retrieval policy), and the scoring rubrics, then executes the eval in parallel and emits a structured report. The output is what the team uses to decide whether the candidate is better than what's in production.

Why it matters

Building an eval set is one half of the work. The other half is having a harness that runs it reliably, in parallel, with calibrated scoring and proper attribution. A great eval set with a flaky harness produces scores nobody trusts.

How it works

Reads the eval set from version control. Spawns parallel workers that call the candidate through the AI Platform gateway (so cost and latency reflect production). Applies rubric-based scoring or LLM-as-judge scoring with human calibration. Emits a structured diff against the baseline. Promotion gates consume the diff.

Related resources

Evals

Short for 'evaluations' — the test cases and harness that measure whether an AI workflow is working, before and after every change.

LLM-as-Judge

Using one language model to score the output of another against a rubric — calibrated against human raters to make the scores trustworthy.

Promotion Gates

The thresholds an AI change must clear before it reaches production — quality, latency, cost, memory, safety — enforced by CI, not by hope.

Workflow Evals

The test suite your AI workflows have to pass before any change reaches users — measuring quality, latency, cost, and safety on real production data instead of vibes.

Evaluation Harness

What it is

Why it matters

How it works

Related concepts

Related resources