Research notes

Chat Orchestration Runtime

The end-to-end architecture of modern conversational AI systems: model-agnostic, client-agnostic, plugin-driven runtimes that coordinate intent, context, retrieval, tools, reasoning, reflection, memory, and rendering — with the LLM as one interchangeable component, not the system.

Operating principle

The model is not the system. The orchestration runtime is. It owns execution, memory, safety, context assembly, and delivery; the model is the reasoning engine you swap when something better arrives.

Demo · runtime pipeline

One ask, ten orchestrated stages.

Group e-media · Operations Agent
Runtime demo · synthetic trace
This customer thread looks like churn risk. Find related failures and draft the operational follow-up.
Reasoning
Reading your message
This customer thread looks like churn risk. Find related failures and
runtime: v2Sonnet 4.6
Orchestration — behind the scene
1 / 10

Backstage: how the answer is being built. The user only sees the chat — this is the system view.

  1. 01Input ingestion
    Normalize the inbound client
  2. 02Lightweight preprocessing
  3. 03Intent & entities
  4. 04Context engineering
  5. 05Retrieval & memory
  6. 06Tool orchestration
  7. 07Reasoning loop
  8. 08Reflection & validation
  9. 09Output formatting
  10. 10Guardrails & delivery

Why this paper exists

Modern AI applications are no longer wrappers around a single LLM. The real intelligence emerges from the orchestration layer that coordinates models, tools, memory, retrieval, reasoning loops, guardrails, and rendering systems. Teams that ship reliable AI in production are not the ones with the best prompts — they are the ones who built the runtime around the prompt. This paper sets out the architecture we use when we build that runtime.

Core philosophy

The runtime stays model-agnostic, client-agnostic, and plugin-driven. Models become interchangeable reasoning engines. Clients (web, shell, API, SMS, email, voice, embedded) describe their capabilities and the runtime adapts the output shape to match. Plugins extend retrieval, tools, guardrails, and rendering without changes to the core. This is what makes a runtime survive the next model release.

  • Model-agnostic — swap providers without rewriting the system
  • Client-agnostic — same brain, many surfaces
  • Plugin-driven — capabilities extend without core changes
  • Trace-first — every stage emits structured execution data

Execution flow — ten stages

Every turn moves through a layered execution pipeline. Input ingestion normalizes incoming content from the active client. Lightweight preprocessing detects sentiment, urgency, and language with a fast cheap model. Intent and entity extraction does two things at once: it classifies the verb of the request (what the user wants done) and extracts the structured entities the request operates on — the primary subject, plus time, geography, scope, and any other typed slots the request fills in. Context enrichment assembles the optimal prompt state — retrieval, compression, memory selection, semantic ranking, freshness scoring, token budgeting. Retrieval-and-memory assembly merges vector, keyword, graph, and structured sources with the conversation's working memory. Tool orchestration invokes APIs, code, queries, charts, files, or external agents through governed connections. The main reasoning loop is iterative — plan, call, reflect, retry — with reasoning depth scaled to complexity and cost budget. Reflection and validation grade the output for hallucination, citation, policy, and quality. Output formatting renders to the client's preferred medium. Guardrails and delivery enforce the last policy checks before the response leaves the runtime.

  • 1. Input ingestion
  • 2. Lightweight preprocessing
  • 3. Intent & entities
  • 4. Context enrichment
  • 5. Retrieval and memory assembly
  • 6. Tool orchestration
  • 7. Main reasoning loop
  • 8. Reflection and validation
  • 9. Output formatting
  • 10. Guardrails and delivery

Intent and entities — what the request is, and what it operates on

Older runtimes treat intent classification as a single routing step. In practice, every request carries an intent plus a set of typed entities — and the runtime needs both pieces to do anything useful. 'Give me Ring's revenue for today in Canada' has one intent (a daily-revenue request) and three entities: { game: 'Ring' } as the primary subject, { period: 'today' } as the temporal scope, and { region: 'CA' } as the geographic scope. 'Give me Concierge's revenue for last quarter in EMEA' is the same intent with a different entity set. Treating intent and entities as a single routing decision merges what should stay separate; treating them as two extraction tasks — one classifier for the verb, one structured extractor for the slots — makes both reusable across hundreds of variants. The intent picks the workflow, tool, or specialized reasoning mode; the entities — schema.org-style, each with a type, identifying fields, and a role — drive retrieval scope, tool argument binding, permission resolution, and the slice of memory that survives context compression. Entities are plural by design: a single request can name a game, a country, a time window, a customer segment, and a comparison cohort all at once, and each one feeds the stages that follow.

  • Intent: the verb · classified into a known workflow or reasoning mode
  • Entities: typed slots · primary subject, time, geography, scope, comparators
  • Each entity carries a role — schema.org-style { type, name, role, identifiers }
  • Same intent + different entities = same code path, different binding
  • Same subject entity + different intents = a session that 'follows' that entity
  • Entities drive retrieval scope, tool arguments, permissions, and memory selection

Context engineering as the differentiator

The stage that separates a competent runtime from a great one is context engineering. The prompt state is not a static template — it is dynamically assembled from retrieval, semantic ranking, freshness, relevance, memory selection, and a token budget that must accommodate the chosen model. The runtime negotiates: which memories survive, which retrieved chunks earn their tokens, which prior turns get summarized rather than kept verbatim. The output quality of every downstream stage is upstream-bound by this one.

Retrieval-augmented generation

Modern RAG is hybrid. Vector search alone misses exact strings; keyword search alone misses semantic siblings; graph retrieval surfaces relationships neither catches. Combine semantic, symbolic, graph, and structured retrieval with reranking and temporal relevance scoring. The runtime owns the retrieval policy — which sources for which intents, with what freshness, under whose permissions — not the application code.

Tools and MCP execution

Tool orchestration is where the runtime meets the real world. APIs, code execution, document retrieval, database queries, chart generation, file manipulation, external agent calls — all routed through a registry that enforces permissions, scope, retries, timeouts, and execution tracing. The Model Context Protocol (MCP) is the open standard we use to declare tools with typed schemas; the registry is what makes those tools safe to expose.

Adaptive reasoning

The main reasoning loop is the cognitive engine. It supports iterative execution with planning, tool use, reflection, and retry strategies — but reasoning depth is adaptive. A simple FAQ does not need a planner; an incident triage with five tool calls does. Cost and latency constraints shape how deep the loop goes on any given turn. The runtime decides; the model executes.

Reflection and validation

Production runtimes increasingly run a self-evaluation stage after generation: hallucination detection, citation verification, policy checks, confidence scoring, response quality grading. This is not bolted-on safety — it is the stage that turns an unreliable single-shot generation into an output the runtime can stand behind. Failed validations route to retry, escalate, or refuse with explanation.

Multi-client rendering

One brain, many surfaces. The same orchestration result renders as plain text in a shell, markdown in a chat client, HTML widgets in a web app, structured JSON for an API consumer, a non-streaming email summary, a multimedia card with charts on a dashboard, a voice utterance with SSML over a phone call. The runtime negotiates capabilities with each client at session start; the rendering layer adapts.

Guardrails at every layer

Guardrails are not a final filter. They operate at five stages: pre-input (block disallowed prompts), retrieval (filter sources the caller is not authorized to see), tool execution (scope and permission checks), generation (policy-aware decoding), and final delivery (output classifiers). Policies vary by user role, intent class, client platform, enterprise rules, and regulatory constraints — and the runtime is the single place those policies live.

Memory architectures

Modern runtimes maintain layered memory: short-term conversational memory (the current turn's working set), episodic memory (prior turns in this conversation), semantic long-term memory (facts learned across sessions), user profile memory (preferences, role, history), workspace memory (the entities relevant to the current task), and execution state memory (in-flight tool calls and partial results). Each layer has its own retention, retrieval policy, and access boundary.

Conclusion

The future of conversational AI is not bigger models. It is more capable runtimes. The orchestration layer — context, retrieval, tools, reasoning, reflection, memory, rendering, guardrails — is the operating system of AI. The model is one interchangeable component inside it.

Related resources