Use case
Regression Datasets
Curated eval cases drawn from production failures — every fixed bug becomes a regression that future changes must clear.
Overview
Bug fixes that never become regression tests are bug fixes the team will ship again. Regression datasets are the discipline that makes 'never twice' a real claim.
What it solves
Prevents the team from re-shipping known failures. Compounds the eval set's value as the system matures.
How we build it
Every triaged failure becomes a candidate regression: the input that triggered it, the bad output, the corrected output. Curation deduplicates and labels; the result is part of the eval set the harness runs on every change.
- Failure-to-regression pipeline
- Deduplication and labeling
- Per-case version and provenance
- Run on every change
What changes when it is in place
Quality drift becomes detectable. Repeat failures become rare.