Use case

Eval Case Capture

Hard or high-value conversations promoted into the eval set — gold examples to preserve and regressions to prevent.

Overview

Every interesting conversation is either an example the team wants to preserve or a failure the team wants to prevent. Capture makes both useful.

What it solves

Compounds the eval set's value as the system matures. Removes the 'we'll write evals later' debt.

How we build it

Reviewers flag conversations for capture during normal triage. Captured cases go through a quick labeling step (gold, regression, ambiguous) and land in the eval set with attribution. The harness re-runs on the new set; impact is measurable.

  • Flag during triage
  • Quick labeling and curation
  • Land in eval set with attribution
  • Harness re-run on capture

What changes when it is in place

The eval set keeps pace with the system. New regressions become rare.