Production Agent Interfaces
The chat surface as an operating console — knowledge bases plugged in, tools connected, agents on a roster, with real-time visibility into context budget, token spend, model choice, and concrete savings opportunities. The interface that lets a team actually run an agent in production, not just demo one.
If you can't see the cost, the context, or the model choice on a turn-by-turn basis, you're not running the agent — the agent is running you. A production chat interface is an observability surface first, a chat second.
“I want to launch agents you can easily connect tools to — with built-in visibility on cost, context, and the right model for the right task.”
40K/200K·Sonnet 4.6
Why this paper exists
Most AI chat products optimize for one thing: the quality of the answer. The cost of getting that answer, the model that produced it, the context budget consumed, the tools called along the way, the knowledge sources used — all of it stays hidden behind a smooth conversational surface. That's fine for a consumer toy. It's not fine for a production operation. When the team that runs the agent can't see what the agent is doing, they can't trust it, can't optimize it, and can't stop a runaway turn before the bill arrives. This paper is the design of a chat interface that fixes that: an operating console that happens to look like a chat.
The visibility imperative
A production agent interface exposes five real-time signals on every turn: which model handled it, how much of the context window the turn consumed, what tools were called, which knowledge sources were retrieved, and the dollar cost of input + output tokens. Those signals aren't buried in admin logs or a backend dashboard — they live next to the message bubble, where the team running the agent can read them at a glance. The interface is honest about what just happened, including when it spent more than it should have.
- Context budget · % of window used + raw token counts (e.g. 40K / 200K)
- Cost · input + output, real dollars, per turn and cumulative for the session
- Model · which one ran this turn, why it was chosen
- Tools · which MCP tools were called, with arguments and outcomes
- Knowledge · which corpora / documents were touched, with citation chips
Knowledge bases plugged in
Every chat surface needs a corpus binding. The interface exposes the available knowledge bases as first-class chips next to the composer — selectable per turn, with their freshness and document count visible. The agent uses them transparently, citations come back inline with anchors to the source documents. Users see what was read and where the answer came from. Operators see which corpora are pulling weight and which are dead.
Tools connected
A production chat surface is also a tool surface. MCP servers connect once and expose their actions to the agent — Linear, Notion, Stripe, Slack, an internal API, whatever the team needs. The interface renders tool invocations as inspectable cards: which tool, which arguments, which scope, what came back. A team member reviewing a session can replay the tool call, dispute the result, or revoke the tool's scope without touching infrastructure code. Tools are first-class, not magic.
Agents on a roster
Switching agents inside the same conversation should feel like switching channels. The interface keeps a roster of available agents — Concierge for customer-facing, Operations for incident triage, Sales for pre-call brief, a custom one for your domain — and lets the user pick or auto-routes by intent. Each agent carries its own scope, tools, and knowledge. The conversation history stays continuous; the brain handling it changes. The interface makes that transition visible (no silent persona swap).
Context overview
The context window is a budget, and the interface treats it like one. A small ring next to the composer shows the percentage used; a pop-out panel breaks down input tokens, output tokens, conversation history retained, system prompt size, and retrieved chunks. Users see when the agent is about to run out of room before it starts compressing — or before the next turn costs unexpectedly more because the window finally tipped over.
Cost overview
Cost is rendered as currency, not tokens. Every turn shows its input + output cost; a session total accumulates in the header. Cross-model sessions break the cost down per model so the user can see exactly what the Sonnet escalation added vs the Haiku triage that started the conversation. For internal teams, the cost panel ties to the user's role-scoped budget; for customer-facing surfaces it stays operator-side. Nobody is surprised by month-end.
Model routing per turn
The interface doesn't hide which model handled a turn — and more importantly, doesn't hide which model *should* have handled it. A cheaper model that would have produced the same output gets surfaced as a suggestion. An expensive one that handled a trivial task gets flagged. The routing logic is visible, auditable, and tunable. Over time the team builds a model-routing policy in the same way they'd build a CDN cache policy: by reading what's actually happening and tightening the rules.
Savings suggestions
The most underrated feature of a production chat interface is the moment it tells you you're spending too much. Specific, concrete suggestions: 'This intent class is currently routed to Sonnet 4.6 at $0.18/turn; Haiku handles 94% of the eval set at $0.012/turn — consider re-routing'; 'Retrieval is pulling 32 chunks per turn but the model is only citing 4 — consider lowering top-k'; 'System prompt is 4.2K tokens; 1.8K of it has not changed an answer in the last 200 turns — consider compression'. The interface earns its keep by paying for itself.
Trace exposure
Every turn leaves an inspectable trace — the stages it went through, the tools it called, the chunks it retrieved, the validation it passed. The trace lives where it happens (the chat surface) rather than in a separate observability tool. Click on a bot bubble, get the stage-by-stage execution. That trace is the same artifact the eval system reads, the optimization loop mines, and the auditor reviews. One surface, three jobs.
What it is not
Not a debug panel bolted onto a chatbot — the visibility is the product, not an admin add-on. Not a separate operator UI that lives somewhere else — the operator and the user can be the same person, on the same screen, in the same session. Not a wrapper around a single LLM provider — the interface is a multi-model, multi-tool, multi-agent runtime that exposes its own internals honestly.
What we are building
An AI SDK Elements-compatible composer with built-in context and cost panels; a roster picker tied to the agent registry; MCP tool integration with inline trace cards; knowledge-base chips that surface per-turn; a model router with a tunable policy and an explainability popover; a suggestions engine that watches the session for savings opportunities. The result is a chat surface a team can stand up in front of any agent and trust to keep it honest.
Future possibilities
Persistent cross-session learning of routing preferences ("this user always wants Sonnet for code review, Haiku for FAQ"); shared team-level cost budgets with soft alerts; per-user role-scoped tool access right in the composer; auto-generated runbooks from frequently-traced sessions; multi-agent collaboration surfaces where two agents trade off on the same turn with visible handoffs.
Conclusion
The chat interface is the production console for the agent era. Once you accept that, the design constraints flip: what looks like a chat product becomes a system surface, and what was hidden becomes the actual value. A team running a production agent doesn't need a prettier chat. They need to see what the agent costs, what the agent reads, what tools the agent touched, and where they're leaving money on the table. The interface this paper describes makes that visible by default.
Related resources
The end-to-end architecture of modern conversational AI systems: model-agnostic, client-agnostic, plugin-driven runtimes that coordinate intent, context, retrieval, tools, reasoning, reflection, memory, and rendering — with the LLM as one interchangeable component, not the system.
How an AI agent remembers the user it serves — what they said before, what they prefer, what context not to repeat — without that memory drifting the agent's behavior for everyone else.
A governed catalog of every tool an AI agent can call — your APIs, your databases, your internal systems — with typed schemas, permission scopes, audit trails, and the standard protocol (MCP) that turns 'we exposed it to the LLM' into 'we know exactly who called what when'.
Trace-level visibility into every model call, retrieval, tool invocation, decision, approval, and failure inside an AI workflow — the substrate every other discipline (evals, optimization, governance) reads from.
How an AI system decides which model to call for each step — based on privacy, cost, latency, quality, and what happens when a provider goes down.