Evaluation

Trace Grading

Scoring AI workflow traces — not just final outputs — to detect quality issues at the step level: bad retrievals, wrong tool calls, low-confidence reasoning.

Operating principle

Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.

What it is

Trace grading is evaluation at the step level instead of the output level. Where a basic eval scores 'did the final answer match the gold response,' trace grading scores 'did each step of the reasoning hold up' — did retrieval bring back the right chunks, did the tool calls have the right arguments, did the model's intermediate reasoning make sense, did it stop when it should have.

Why it matters

Output-only evaluation hides the workflow's actual behavior. An agent can get the right answer for the wrong reasons (lucky retrieval, accidental tool match) and fail catastrophically on the next case that doesn't have the same luck. Trace grading is how the team understands what the workflow is actually doing, not just what it produces.

How it works

Production traces are sampled and scored at the step level using rubrics or LLM-as-judge with calibration. Low-scoring steps are surfaced as candidate regressions even when the final output passed. The same traces feed the regression set when failure clusters become repeatable.

Related resources

Evals

Short for 'evaluations' — the test cases and harness that measure whether an AI workflow is working, before and after every change.

Workflow Evals

The test suite your AI workflows have to pass before any change reaches users — measuring quality, latency, cost, and safety on real production data instead of vibes.

LLM-as-Judge

Using one language model to score the output of another against a rubric — calibrated against human raters to make the scores trustworthy.

Agent Observability

Trace-level visibility into every model call, retrieval, tool invocation, decision, approval, and failure inside an AI workflow — the substrate every other discipline (evals, optimization, governance) reads from.

Trace Replay

Deterministically re-running an AI workflow from its stored trace — the debugging primitive that makes 'why did the agent do that' a question with an answer.

Trace Grading

What it is

Why it matters

How it works

Related concepts

Related resources