Use case

Generated-Code Tests

Test coverage for the agent-generated code that lives inside workflows — handlers, transformers, validators — so the eval set catches regressions in code, not just prompts.

Overview

When an agent generates code that runs inside a workflow, the eval set has to cover the code as well as the model. Otherwise a 'safe' prompt change can ship a broken handler.

What it solves

Closes the gap between 'the prompt is fine' and 'the handler the agent generated still does what we need'.

How we build it

Every generated code artifact lands with test cases derived from real workflow inputs. The harness runs the tests, not just the prompt, against each variant. Coverage gaps surface as new test cases.

Per-artifact test cases from real inputs
Tests run in the eval harness
Coverage gap detection
Regression set seeded from production

What changes when it is in place

Generated code stops being a silent failure mode. Changes break loudly when they break.