Use case
Eval Case Capture
Hard or high-value conversations promoted into the eval set — gold examples to preserve and regressions to prevent.
Overview
Every interesting conversation is either an example the team wants to preserve or a failure the team wants to prevent. Capture makes both useful.
What it solves
Compounds the eval set's value as the system matures. Removes the 'we'll write evals later' debt.
How we build it
Reviewers flag conversations for capture during normal triage. Captured cases go through a quick labeling step (gold, regression, ambiguous) and land in the eval set with attribution. The harness re-runs on the new set; impact is measurable.
- Flag during triage
- Quick labeling and curation
- Land in eval set with attribution
- Harness re-run on capture
What changes when it is in place
The eval set keeps pace with the system. New regressions become rare.