← System Design AI Systems
System Design

Multi-Agent Coordination

Multi-agent systems fail at the handoff boundaries — not inside individual agents — because implicit state assumptions between agents break the moment one agent is updated independently.

TL;DR
  • Multi-agent systems fail at the handoff boundaries — not inside individual agents — because implicit state assumptions between agents break the moment one agent is updated independently.
  • Every handoff message must be a versioned, self-contained payload with a correlation ID; agents that rely on shared mutable state instead create invisible coupling.
  • Synchronous agent chains multiply latency: three agents at 200ms each = 600ms minimum end-to-end, plus network overhead and retry time on failure.
  • Trust no upstream agent's output — validate and sanitize at every ingress boundary, even from internal services. Malformed data from one agent cascades into all downstream agents.
  • Choreography scales better than centralized orchestration but loses global workflow visibility; choose based on whether debugging or throughput is the harder constraint.

The Problem

Stripe's payment flow involves fraud detection, authorization, ledger update, and notification agents running in sequence. The fraud agent was updated to return a new risk_score field — but the authorization agent was reading risk_level, the old field name. The authorization agent failed to detect high-risk transactions for 4 hours before the discrepancy surfaced in an alert. This is the canonical multi-agent failure: implicit contracts between agents break silently when one agent evolves independently. At scale, teams own individual agents across different services and release cycles — without explicit, versioned interfaces, any deployment can break the chain.

Core System Idea

Multi-agent coordination requires explicit, versioned contracts between agents — not implicit shared knowledge. Agents communicate via durable message queues (Kafka, SQS, RabbitMQ) or event streams, exchanging self-contained, versioned payloads that include a correlation ID for distributed tracing, a schema version for backward compatibility, and all state the receiving agent needs (no shared mutable stores). Two coordination topologies: orchestration (a central controller owns the workflow state and directs each agent — simpler to debug, single point of failure) and choreography (agents react to events from each other — resilient, but requires robust event infrastructure and more sophisticated per-agent logic). Schema registries (Confluent Schema Registry, AWS Glue) enforce contract compatibility at publish time, not at runtime. Temporal and AWS Step Functions are the dominant orchestration frameworks for durable, auditable multi-agent workflows.

System Flow

flowchart TD A["Request Ingress"] --> B["Orchestrator"] B --> C["Agent A: Validate"] C --> H["Shared State Store"] C --> D["Agent B: Process"] D --> H D --> E["Agent C: Persist"] E --> F["Response"] C --> G["Dead Letter Queue"] D --> G

Orchestrator sequences agents; each writes to shared state with a versioned payload; failures route to DLQ rather than propagating forward.

Real-World Examples Indicative

Stripe's payment processing

Fraud detection, card authorization, payment gateway interaction, and ledger update agents each own a phase of the transaction. A canonical PaymentTransaction object (versioned Protobuf schema) is passed through each step. Idempotency keys at each write step mean that if the ledger agent is retried after a timeout, it doesn't double-credit the account. Schema changes go through a compatibility check before deployment — no agent can emit a breaking schema change without a versioned migration path.

Netflix content pipeline

When a title is uploaded, specialized agents handle transcoding (producing 15+ quality variants), DRM packaging, CDN distribution, and metadata indexing. These run as choreographed event-driven workflows — the transcoder emits a TranscodingComplete event, which the DRM packager consumes. Agents scale independently; the transcoding fleet can process at 10,000 concurrent jobs without the DRM fleet needing to match that capacity.

Uber's dispatch system

Matching, pricing, and ETA agents coordinate for every ride request. Each runs on its own service with strict latency budgets: matching must respond in <500ms, pricing in <200ms, ETA in <300ms. The coordination layer uses timeouts and cached fallbacks — if the ETA agent is slow, dispatch proceeds with a cached estimate rather than blocking the entire flow. Agents are designed to degrade independently, not fail together.

Anti-Patterns

Implicit state through shared mutable stores

Agents reading from a shared Redis key that another agent updates directly creates invisible coupling. When Agent A changes the key schema, Agent B breaks with no warning at deployment time — only at runtime.

Non-idempotent handoffs

A payment agent that doesn't deduplicate on retry charges the card twice. Every cross-agent write must carry an idempotency key derived from the input, not generated fresh on each call.

Synchronous blocking chains

Calling Agent B synchronously inside Agent A's request handler couples their availability. If Agent B is slow or down, Agent A's response time degrades or fails. Use async handoffs via queues; Agent A publishes and returns immediately.

Undefined trust at ingress boundaries

An agent that accepts any input from an upstream agent without schema validation will propagate malformed data — or be exploitable via prompt injection if the payload contains LLM inputs. Validate at every boundary, every time.

No distributed tracing across agents

Without a shared correlation ID threaded through every agent's logs and spans, post-incident debugging requires correlating timestamps across 5 different log streams. Attach a trace ID at ingress and propagate it through every handoff message.

Design Tradeoffs

DimensionCentralized OrchestrationChoreography
State visibilitySingle source of truthMust correlate across event logs
Single point of failureYes (orchestrator)No
Debug complexityOne trace through one logDistributed correlation required
Horizontal scaleBounded by orchestrator capacityScales naturally per-agent
Best forComplex conditional workflowsHigh-throughput event-driven pipelines

Best Practices

Version every inter-agent message schema and enforce backward compatibility with a schema registry. Reject breaking changes at CI time, not at runtime.
Assign a correlation ID at the system entry point and propagate it through every handoff message, log line, and span. This is the single most valuable debugging investment for multi-agent systems.
Make every agent's ingress handler idempotent: re-processing the same message must produce the same result. Test this explicitly — don't assume it.
Use async handoffs via durable queues (Kafka, SQS) rather than synchronous HTTP calls between agents. Queues decouple availability, absorb traffic spikes, and provide replay on failure.
Set explicit timeouts on every cross-agent call and define the degraded behavior when the timeout fires — cached response, skip step, escalate to DLQ — rather than blocking indefinitely.
Monitor inter-agent queue depth and handoff latency as primary SLIs. A growing queue depth is the earliest signal that a downstream agent is struggling, before it manifests as end-to-end timeout.

When to Use / Avoid

Use WhenAvoid When
Workflow has distinct phases owned by different teamsTwo services with a simple request-response contract
Agents need to scale independently (transcoding vs. DRM)Strict sub-100ms latency — coordination adds overhead
Partial failures must be recoverable without restarting the whole flowThe entire pipeline can live in one service without operational cost
Audit trail of each agent's input and output is requiredTeams lack the operational maturity to run distributed tracing across services