Learn

Workflow Evals

The test suite your AI workflows have to pass before any change reaches users — measuring quality, latency, cost, and safety on real production data instead of vibes.

Beyond model bake-offs

The question is not which model wins in isolation on a public benchmark. The question is which workflow shape — prompt, model, retrieval depth, tool budget, node count, fallback path — survives the speed, memory, quality, cost, and safety thresholds your operation requires. A workflow eval is the contract between the team proposing a change and the team operating production.

  • Prompt and model diff testing on the same trace inputs
  • Retrieval depth, hybrid weights, and rerank experiments
  • Generated-code and node-structure mutations
  • Regression datasets seeded from production failures

Composition of a useful eval set

Gold examples drawn from production successes (with PII handling), regressions from production failures, synthetic adversarial cases for known edge classes, and a small held-out human-rated calibration set used to validate the LLM-as-judge rubric. The eval set is itself a versioned artifact that goes through review when it changes.

Promotion gates

Candidates are promoted only when quality, p95 latency, memory, tool reliability, and total price remain inside defined thresholds — and when no regression case in the eval set fails. Gates can be tightened over time as the eval set grows.

What it works with

Pulls inputs from Agent Workflows (production traces become eval cases), the AI Platform (runs candidates through the same gateway as production), and Conversation Intelligence (failure clusters become regressions). Feeds Self-Optimizing Agents (the eval set is the contract the optimizer is honest against) and the Benchmarks service (which packages this discipline as a delivery offering). Sits beside Closed-Loop Knowledge — every triaged failure becomes either a new eval case or a documented exception.

When you need it

Signals: AI quality regressions are surfaced by user complaints rather than CI; the same bug is shipped twice because nobody added it to a regression set; a 'safe' prompt change just broke a customer workflow nobody re-tested. If you have a production AI workflow that matters, you have a workflow that deserves an eval set.