Operations

Agent Observability

Trace-level visibility into every model call, retrieval, tool invocation, decision, approval, and failure inside an AI workflow — the substrate every other discipline (evals, optimization, governance) reads from.

Operating principle

Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.

Trace everything

Agent behavior is non-deterministic. Traces make runs inspectable, comparable, replayable, and useful for evaluation. We emit OpenTelemetry-compatible spans for every model call, tool invocation, retrieval step, decision branch, approval event, and external side effect — with cost, latency, token counts, and inputs and outputs (subject to PII rules).

Model, token, and latency spans (OpenTelemetry)
Tool invocation inputs, outputs, and timing
Retrieval evidence, scores, and citations
Approval, rejection, and escalation events

Stacks we use

Langfuse, Arize Phoenix, Helicone, and Datadog are common destinations depending on the rest of the stack. The data shape is what matters: spans that link parent-child across model-tool-retrieval boundaries, attributes for cost and quality, and a stable trace ID that follows a workflow from event in to outcome out.

Why traces feed evals

An eval set without trace context tells you which prompt scored highest on a static benchmark. An eval set built from traces tells you which prompt scored highest on the queries your users actually sent last week, with the retrieval context they actually saw. The first is research; the second is operations.

What it works with

Every other layer reads from this one. Workflow Evals builds eval cases from traces. Self-Optimizing Agents reads route attribution. Governance reads audit events. Closed-Loop Knowledge clusters failures. Conversation Forensics replays threads from traces. Without observability, every other discipline is guessing.

When you need it

Signals: 'why did the agent do that' is a question that gets shrugged at; AI cost is a single line on the cloud bill with no per-workflow attribution; a customer complains about a response and you cannot reproduce the run. If you have a production AI workflow that anyone is going to ask hard questions about, observability is the work to do first.

Related resources

Workflow Evals

The test suite your AI workflows have to pass before any change reaches users — measuring quality, latency, cost, and safety on real production data instead of vibes.

MCP Tool Registry

A governed catalog of every tool an AI agent can call — your APIs, your databases, your internal systems — with typed schemas, permission scopes, audit trails, and the standard protocol (MCP) that turns 'we exposed it to the LLM' into 'we know exactly who called what when'.

Governance

The policy layer for what an AI system is allowed to read, call, decide, and ship — encoded as configuration the runtime enforces, not as a document on a shared drive.

Trace Replay

Deterministically re-running an AI workflow from its stored trace — the debugging primitive that makes 'why did the agent do that' a question with an answer.

Eval Dashboards

A capability in the Group e-media information AI stack. This resource connects the subject to data substrate, agent runtime, evals, and operations.