Research notes

Conversation Forensics

Incident detection and root-cause analysis on human↔agent conversations — replaying threads, reading the context around negative sentiment, extracting whether the user actually resolved their problem, and turning the answer into a learning artifact the system can use next time.

Operating principle

Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.

Detecting incidents in human↔agent conversations

An incident is a moment in a thread where something went measurably wrong: the user rephrased the same question three times, the sentiment turned negative and stayed there, the user explicitly asked for a human, the agent's confidence dropped below a threshold, a tool call failed and the recovery did not land, or the user abandoned mid-conversation. Detection runs on the live signal — sentiment trajectory, repeat-question pattern, escalation requests, tool error events, dwell time — and on the trace metadata the runtime already emits. Detection is per-thread, not per-message, because the incident usually only makes sense in the arc.

Sentiment trajectory and negative-stick detection
Repeat-question and rephrase patterns
Explicit escalation signals ('let me talk to a human')
Tool failure and unrecovered error events
Mid-conversation abandonment

Replaying the thread around the moment

Once an incident is flagged, the forensics view reconstructs the window around it: the messages before and after, the agent's intermediate reasoning, what was retrieved and from where, which tools were called with which inputs and outputs, the model parameters and route at each step. The reviewer (or the analysis pipeline) sees not just what the agent said but what the agent saw — the difference matters when diagnosing a hallucination versus a bad retrieval versus a stale knowledge source.

Outcome extraction: did the user actually solve it?

An incident that ends with the user thanking the agent is a very different artifact from one that ends with the user closing the tab. We extract resolution outcome per thread: solved with the agent alone, solved after human escalation, solved on a follow-up thread, abandoned without resolution, or unclear. The outcome is paired with the incident signal and with what the agent (or human) actually did — so the next time a similar incident is detected, there is real prior evidence about what worked and what did not.

Solved with agent only — what action or answer closed it
Solved after escalation — what the human did differently
Resolved on a follow-up thread — how the user came back
Abandoned — and whether the user returned via a different channel

What forensics produces as a learning artifact

Every analyzed incident becomes a structured record: the failure signal, the trace excerpt, the hypothesized cause (retrieval, reasoning, tool, knowledge, policy), the resolution outcome, and — when one exists — the resolution recipe. Aggregated, these records form a named cluster of related failures with a count, a trend, a recommended owner, and a candidate remedy. The same records feed the eval set (as regression cases), the knowledge base (when the resolution was a missing answer), and the workflow itself (when a high-confidence recipe can be wired in as an automatic recovery path).

Structured incident record per thread
Cluster with count, trend, owner, and remedy
Regression cases for the eval set
Knowledge updates when the resolution was missing context
Recovery recipes wired back into the workflow under approval

How we cluster

Embedding similarity on the failure signal (the question, the unresolved exchange, the trace summary, the extracted incident features) seeds initial clusters; model-assisted labeling refines them; humans review and merge. The cluster identity is stable enough that new bad threads either fit existing clusters or surface as candidates for new ones — and the resolution outcome travels with the cluster, so a pattern can be flagged as 'recurring and still unsolved' versus 'recurring but the human path works' versus 'we have a recipe; let's automate it'.

Where the loop closes

Retrieval failures become source-graph or chunking work. Reasoning failures become prompt or model changes that go through the eval gate. Tool failures become reliability work in the runtime. Knowledge failures become updates to the underlying source. Policy failures become governance changes. And when the resolution outcome is well-characterized, the recipe itself becomes a candidate workflow change — proposed, evaluated against the regression set, and promoted under approval. Each closure decrements the cluster count; new failures grow it; the rate is the visible state of the team's closed-loop discipline.

Related resources

Conversation Intelligence

Turning every approved conversation — support, email, team chat, customer messaging, voice, sales — into structured signal you can act on, instead of anecdotes that evaporate when a ticket closes.

Signal Extraction

Turning raw conversation transcripts into structured fields — intent, subject, sentiment, CSAT, tool performance, product mentions — that downstream systems can query, dashboard, and act on.

Closed-Loop Knowledge

How an AI system gets durably better at its job — not by being smarter, but by routing every production failure into either a knowledge update, an eval case, a workflow patch, or a documented exception with a named owner.

Workflow Evals

The test suite your AI workflows have to pass before any change reaches users — measuring quality, latency, cost, and safety on real production data instead of vibes.