Workflow Evals
The test suite your AI workflows have to pass before any change reaches users — measuring quality, latency, cost, and safety on real production data instead of vibes.
Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.
Beyond model bake-offs
The question is not which model wins in isolation on a public benchmark. The question is which workflow shape — prompt, model, retrieval depth, tool budget, node count, fallback path — survives the speed, memory, quality, cost, and safety thresholds your operation requires. A workflow eval is the contract between the team proposing a change and the team operating production.
- Prompt and model diff testing on the same trace inputs
- Retrieval depth, hybrid weights, and rerank experiments
- Generated-code and node-structure mutations
- Regression datasets seeded from production failures
Composition of a useful eval set
Gold examples drawn from production successes (with PII handling), regressions from production failures, synthetic adversarial cases for known edge classes, and a small held-out human-rated calibration set used to validate the LLM-as-judge rubric. The eval set is itself a versioned artifact that goes through review when it changes.
Promotion gates
Candidates are promoted only when quality, p95 latency, memory, tool reliability, and total price remain inside defined thresholds — and when no regression case in the eval set fails. Gates can be tightened over time as the eval set grows.
What it works with
Pulls inputs from Agent Workflows (production traces become eval cases), the AI Platform (runs candidates through the same gateway as production), and Conversation Intelligence (failure clusters become regressions). Feeds Self-Optimizing Agents (the eval set is the contract the optimizer is honest against) and the Benchmarks service (which packages this discipline as a delivery offering). Sits beside Closed-Loop Knowledge — every triaged failure becomes either a new eval case or a documented exception.
When you need it
Signals: AI quality regressions are surfaced by user complaints rather than CI; the same bug is shipped twice because nobody added it to a regression set; a 'safe' prompt change just broke a customer workflow nobody re-tested. If you have a production AI workflow that matters, you have a workflow that deserves an eval set.
Related resources
AI workflows that propose, score, and promote their own variants — prompts, models, retrieval policies, tool budgets, generated code — under measurable constraints instead of intuition or vendor leaderboards.
Trace-level visibility into every model call, retrieval, tool invocation, decision, approval, and failure inside an AI workflow — the substrate every other discipline (evals, optimization, governance) reads from.
How an AI system decides which model to call for each step — based on privacy, cost, latency, quality, and what happens when a provider goes down.
Side-by-side measurement of a candidate prompt or model against the current production version on the same eval set — the unit of safe change in a serious AI workflow.
The thresholds an AI change must clear before it reaches production — quality, latency, cost, memory, safety — enforced by CI, not by hope.