Use case

Regression Datasets

Curated eval cases drawn from production failures — every fixed bug becomes a regression that future changes must clear.

Overview

Bug fixes that never become regression tests are bug fixes the team will ship again. Regression datasets are the discipline that makes 'never twice' a real claim.

What it solves

Prevents the team from re-shipping known failures. Compounds the eval set's value as the system matures.

How we build it

Every triaged failure becomes a candidate regression: the input that triggered it, the bad output, the corrected output. Curation deduplicates and labels; the result is part of the eval set the harness runs on every change.

  • Failure-to-regression pipeline
  • Deduplication and labeling
  • Per-case version and provenance
  • Run on every change

What changes when it is in place

Quality drift becomes detectable. Repeat failures become rare.