Use case

Eval Dashboards

Operational dashboards that show the eval-set score, regression count, latency, and cost over time — the visible state of the team's quality discipline.

Overview

Evals that nobody looks at are evals that quietly stop running. A dashboard is the difference between a discipline and an artifact.

What it solves

Makes the quality of the workflow set a visible, owned metric instead of a private state of the harness.

How we build it

Dashboards per workflow and per portfolio: score over time, regression count, latency p95, cost per request, recent failure clusters. Drill-through into the trace for each failure. Owners receive a weekly summary; declines page.

  • Score, regression, latency, cost over time
  • Per-workflow and portfolio views
  • Drill-through to traces
  • Weekly summary and alerting

What changes when it is in place

Quality has a visible owner and a visible trend.