Operations

Agent Observability

Trace-level visibility into every model call, retrieval, tool invocation, decision, approval, and failure inside an AI workflow — the substrate every other discipline (evals, optimization, governance) reads from.

Operating principle

Production AI is not a prompt. It is a system of context, tools, permissions, traces, evals, and feedback loops.

Trace everything

Agent behavior is non-deterministic. Traces make runs inspectable, comparable, replayable, and useful for evaluation. We emit OpenTelemetry-compatible spans for every model call, tool invocation, retrieval step, decision branch, approval event, and external side effect — with cost, latency, token counts, and inputs and outputs (subject to PII rules).

  • Model, token, and latency spans (OpenTelemetry)
  • Tool invocation inputs, outputs, and timing
  • Retrieval evidence, scores, and citations
  • Approval, rejection, and escalation events

Stacks we use

Langfuse, Arize Phoenix, Helicone, and Datadog are common destinations depending on the rest of the stack. The data shape is what matters: spans that link parent-child across model-tool-retrieval boundaries, attributes for cost and quality, and a stable trace ID that follows a workflow from event in to outcome out.

Why traces feed evals

An eval set without trace context tells you which prompt scored highest on a static benchmark. An eval set built from traces tells you which prompt scored highest on the queries your users actually sent last week, with the retrieval context they actually saw. The first is research; the second is operations.

What it works with

Every other layer reads from this one. Workflow Evals builds eval cases from traces. Self-Optimizing Agents reads route attribution. Governance reads audit events. Closed-Loop Knowledge clusters failures. Conversation Forensics replays threads from traces. Without observability, every other discipline is guessing.

When you need it

Signals: 'why did the agent do that' is a question that gets shrugged at; AI cost is a single line on the cloud bill with no per-workflow attribution; a customer complains about a response and you cannot reproduce the run. If you have a production AI workflow that anyone is going to ask hard questions about, observability is the work to do first.

Related resources