← System Design AI Systems
System Design

AI Agent Orchestration

Without an explicit orchestration layer, multi-agent systems devolve into infinite loops, redundant API calls, and non-deterministic failures within minutes of hitting a real-world edge case.

TL;DR
  • Without an explicit orchestration layer, multi-agent systems devolve into infinite loops, redundant API calls, and non-deterministic failures within minutes of hitting a real-world edge case.
  • Model the workflow as an explicit state machine with hard iteration limits (10–20 steps max) — agents without termination budgets will exhaust your LLM quota in minutes.
  • Centralized orchestrators simplify state and debugging but become a bottleneck; peer choreography scales better but loses global visibility.
  • Treat every agent output as untrusted: validate schema, content, and semantic correctness before passing to the next agent.
  • Blind tool chaining without idempotency guards causes duplicate transactions, unauthorized data access, and irreversible side effects.

The Problem

A research agent calls a web search tool, gets ambiguous results, asks a summarizer agent to clarify, which re-triggers the search agent, which loops indefinitely — burning $40 in API costs in 90 seconds before anyone notices. This is the defining failure mode of multi-agent systems: without a coordination layer, agents misinterpret hand-off state, re-attempt completed subtasks, and wait on inputs that will never arrive. The system produces non-deterministic outputs, makes debugging intractable (which agent diverged, and when?), and becomes impossible to cost-control at scale.

Core System Idea

AI agent orchestration establishes a dedicated control plane that owns the lifecycle of the entire workflow: task decomposition, agent assignment, state tracking, and termination. The orchestrator maintains the canonical workflow state and is the only component that can advance, retry, or abort a step. Two topologies exist: centralized orchestration (one orchestrator directs all agents — simpler to debug, single point of failure) and choreography (agents react to events from each other — more resilient, harder to observe). Frameworks like LangGraph, Temporal, and AWS Step Functions encode this control flow as explicit graphs or state machines, making transitions auditable and reproducible. The orchestrator also owns tool call authorization: it decides which agent can call which external API, with what parameters, and under what rate limits.

System Flow

flowchart TD A["User Goal"] --> B["Orchestrator"] B --> C["Task Decomposer"] C --> B B --> D["Worker Agent A"] D --> E["External Tool / API"] E --> D D --> B B --> F["Output Validator"] F --> B B --> G["Final Response"]

Orchestrator owns all state transitions — agents report back rather than chaining directly to each other.

Real-World Examples Indicative

Salesforce Agentforce

Coordinates a sequence of specialized agents — a research agent, a CRM lookup agent, an email draft agent, and a human approval gate — to handle sales outreach end-to-end. Each agent has a defined tool scope and output schema; the orchestrator validates outputs against expected schemas before advancing the workflow, rejecting malformed results and triggering retries rather than propagating bad data downstream.

OpenAI ChatGPT with tools

The model itself acts as a lightweight orchestrator, deciding which tool to invoke (web search, code interpreter, image generation) and in what order, based on the conversation state. Tool results are injected back into context; the model re-plans after each result. Iteration is bounded by the context window filling up — a soft but real termination mechanism.

Waymo autonomous driving stack

Coordinates perception (100Hz), prediction (20Hz), and planning (10Hz) agents with strict timing contracts. The orchestration layer handles priority, conflict resolution when agents disagree, and safe fallback (pull over, stop) when any agent produces low-confidence output. Missing a deadline in one agent does not cascade — each module degrades gracefully within its own safety envelope.

Anti-Patterns

Implicit state through conversation history

Relying on agents to infer workflow state from prior messages. State is lost whenever context is truncated, causing agents to restart completed steps and producing duplicate work or duplicate side effects.

No iteration budget

Workflows without hard step limits (e.g., max 15 tool calls) will loop until the context fills or the quota runs out. A misconfigured agent in a loop at GPT-4 rates costs ~$1/minute before anyone is paged.

Ambiguous success criteria

Telling an agent to "improve the code until it's good" guarantees it never terminates. Every sub-task must have a binary, measurable exit condition.

Blind tool chaining without idempotency

Allowing agents to call write APIs (send email, charge card, update DB) without orchestrator-level deduplication. Network retries cause duplicate transactions; there's no safe undo.

Orchestrator as dumb router

An orchestrator that only forwards messages without tracking state or validating outputs gives you all the operational complexity of a distributed system with none of the coordination benefits.

Design Tradeoffs

DimensionCentralized OrchestratorPeer Choreography
State visibilitySingle source of truthDistributed, must correlate across agents
Failure blast radiusOrchestrator failure stops all workOne agent failure is isolated
Throughput ceilingBottlenecked at orchestratorScales horizontally
Debug complexityLinear trace through one logRequires cross-agent correlation IDs

Best Practices

Encode the workflow as an explicit state machine (LangGraph, Temporal, AWS Step Functions) — named states, defined transitions, explicit terminal states for both success and failure.
Set hard iteration budgets per workflow (e.g., max 15 agent turns) and per tool (e.g., max 3 web search calls). Log and alert when budgets are approaching, not just when exhausted.
Treat all agent outputs as untrusted external input: validate JSON schema, field types, and semantic plausibility before passing to the next agent.
Make all tool calls idempotent or guard them with a deduplication key. The orchestrator should be the only component that initiates write operations.
Capture a structured trace for every agent turn: input, output, tool calls made, latency, token count. Without this, post-incident debugging is a guessing game.
Apply circuit breakers to external tool dependencies — if a tool fails 3× in 60 seconds, the orchestrator should fail the workflow fast rather than keep retrying.

When to Use / Avoid

Use WhenAvoid When
Task requires sequential steps where each step's output feeds the nextSingle-agent task fits in one LLM call
Multiple specialized agents must collaborate (researcher + writer + validator)Latency budget is under 500ms — orchestration adds overhead
Workflow needs auditable history and retry-on-failure at each stepAll steps are stateless and independent (use parallel fan-out instead)
External tools with side effects (APIs, DBs) must be called with guardrailsWorkflow graph is fully static — hardcode it, don't orchestrate dynamically