Use case
Eval Dashboards
Operational dashboards that show the eval-set score, regression count, latency, and cost over time — the visible state of the team's quality discipline.
Overview
Evals that nobody looks at are evals that quietly stop running. A dashboard is the difference between a discipline and an artifact.
What it solves
Makes the quality of the workflow set a visible, owned metric instead of a private state of the harness.
How we build it
Dashboards per workflow and per portfolio: score over time, regression count, latency p95, cost per request, recent failure clusters. Drill-through into the trace for each failure. Owners receive a weekly summary; declines page.
- Score, regression, latency, cost over time
- Per-workflow and portfolio views
- Drill-through to traces
- Weekly summary and alerting
What changes when it is in place
Quality has a visible owner and a visible trend.