Agent Observability
Trace-level visibility into every model call, retrieval, tool invocation, decision, approval, and failure inside an AI workflow — the substrate every other discipline (evals, optimization, governance) reads from.
Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.
Trace everything
Agent behavior is non-deterministic. Traces make runs inspectable, comparable, replayable, and useful for evaluation. We emit OpenTelemetry-compatible spans for every model call, tool invocation, retrieval step, decision branch, approval event, and external side effect — with cost, latency, token counts, and inputs and outputs (subject to PII rules).
- Model, token, and latency spans (OpenTelemetry)
- Tool invocation inputs, outputs, and timing
- Retrieval evidence, scores, and citations
- Approval, rejection, and escalation events
Stacks we use
Langfuse, Arize Phoenix, Helicone, and Datadog are common destinations depending on the rest of the stack. The data shape is what matters: spans that link parent-child across model-tool-retrieval boundaries, attributes for cost and quality, and a stable trace ID that follows a workflow from event in to outcome out.
Why traces feed evals
An eval set without trace context tells you which prompt scored highest on a static benchmark. An eval set built from traces tells you which prompt scored highest on the queries your users actually sent last week, with the retrieval context they actually saw. The first is research; the second is operations.
What it works with
Every other layer reads from this one. Workflow Evals builds eval cases from traces. Self-Optimizing Agents reads route attribution. Governance reads audit events. Closed-Loop Knowledge clusters failures. Conversation Forensics replays threads from traces. Without observability, every other discipline is guessing.
When you need it
Signals: 'why did the agent do that' is a question that gets shrugged at; AI cost is a single line on the cloud bill with no per-workflow attribution; a customer complains about a response and you cannot reproduce the run. If you have a production AI workflow that anyone is going to ask hard questions about, observability is the work to do first.
Related resources
The test suite your AI workflows have to pass before any change reaches users — measuring quality, latency, cost, and safety on real production data instead of vibes.
A governed catalog of every tool an AI agent can call — your APIs, your databases, your internal systems — with typed schemas, permission scopes, audit trails, and the standard protocol (MCP) that turns 'we exposed it to the LLM' into 'we know exactly who called what when'.
The policy layer for what an AI system is allowed to read, call, decide, and ship — encoded as configuration the runtime enforces, not as a document on a shared drive.
Deterministically re-running an AI workflow from its stored trace — the debugging primitive that makes 'why did the agent do that' a question with an answer.
A capability in the Group e-media information AI stack. This resource connects the subject to data substrate, agent runtime, evals, and operations.