Research notes

Conversation Forensics

Incident detection and root-cause analysis on human↔agent conversations — replaying threads, reading the context around negative sentiment, extracting whether the user actually resolved their problem, and turning the answer into a learning artifact the system can use next time.

Operating principle

Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.

Detecting incidents in human↔agent conversations

An incident is a moment in a thread where something went measurably wrong: the user rephrased the same question three times, the sentiment turned negative and stayed there, the user explicitly asked for a human, the agent's confidence dropped below a threshold, a tool call failed and the recovery did not land, or the user abandoned mid-conversation. Detection runs on the live signal — sentiment trajectory, repeat-question pattern, escalation requests, tool error events, dwell time — and on the trace metadata the runtime already emits. Detection is per-thread, not per-message, because the incident usually only makes sense in the arc.

  • Sentiment trajectory and negative-stick detection
  • Repeat-question and rephrase patterns
  • Explicit escalation signals ('let me talk to a human')
  • Tool failure and unrecovered error events
  • Mid-conversation abandonment

Replaying the thread around the moment

Once an incident is flagged, the forensics view reconstructs the window around it: the messages before and after, the agent's intermediate reasoning, what was retrieved and from where, which tools were called with which inputs and outputs, the model parameters and route at each step. The reviewer (or the analysis pipeline) sees not just what the agent said but what the agent saw — the difference matters when diagnosing a hallucination versus a bad retrieval versus a stale knowledge source.

Outcome extraction: did the user actually solve it?

An incident that ends with the user thanking the agent is a very different artifact from one that ends with the user closing the tab. We extract resolution outcome per thread: solved with the agent alone, solved after human escalation, solved on a follow-up thread, abandoned without resolution, or unclear. The outcome is paired with the incident signal and with what the agent (or human) actually did — so the next time a similar incident is detected, there is real prior evidence about what worked and what did not.

  • Solved with agent only — what action or answer closed it
  • Solved after escalation — what the human did differently
  • Resolved on a follow-up thread — how the user came back
  • Abandoned — and whether the user returned via a different channel

What forensics produces as a learning artifact

Every analyzed incident becomes a structured record: the failure signal, the trace excerpt, the hypothesized cause (retrieval, reasoning, tool, knowledge, policy), the resolution outcome, and — when one exists — the resolution recipe. Aggregated, these records form a named cluster of related failures with a count, a trend, a recommended owner, and a candidate remedy. The same records feed the eval set (as regression cases), the knowledge base (when the resolution was a missing answer), and the workflow itself (when a high-confidence recipe can be wired in as an automatic recovery path).

  • Structured incident record per thread
  • Cluster with count, trend, owner, and remedy
  • Regression cases for the eval set
  • Knowledge updates when the resolution was missing context
  • Recovery recipes wired back into the workflow under approval

How we cluster

Embedding similarity on the failure signal (the question, the unresolved exchange, the trace summary, the extracted incident features) seeds initial clusters; model-assisted labeling refines them; humans review and merge. The cluster identity is stable enough that new bad threads either fit existing clusters or surface as candidates for new ones — and the resolution outcome travels with the cluster, so a pattern can be flagged as 'recurring and still unsolved' versus 'recurring but the human path works' versus 'we have a recipe; let's automate it'.

Where the loop closes

Retrieval failures become source-graph or chunking work. Reasoning failures become prompt or model changes that go through the eval gate. Tool failures become reliability work in the runtime. Knowledge failures become updates to the underlying source. Policy failures become governance changes. And when the resolution outcome is well-characterized, the recipe itself becomes a candidate workflow change — proposed, evaluated against the regression set, and promoted under approval. Each closure decrements the cluster count; new failures grow it; the rate is the visible state of the team's closed-loop discipline.

Related resources