Multi-Agent Coordination
Multi-agent systems fail at the handoff boundaries — not inside individual agents — because implicit state assumptions between agents break the moment one agent is updated independently.
- Multi-agent systems fail at the handoff boundaries — not inside individual agents — because implicit state assumptions between agents break the moment one agent is updated independently.
- Every handoff message must be a versioned, self-contained payload with a correlation ID; agents that rely on shared mutable state instead create invisible coupling.
- Synchronous agent chains multiply latency: three agents at 200ms each = 600ms minimum end-to-end, plus network overhead and retry time on failure.
- Trust no upstream agent's output — validate and sanitize at every ingress boundary, even from internal services. Malformed data from one agent cascades into all downstream agents.
- Choreography scales better than centralized orchestration but loses global workflow visibility; choose based on whether debugging or throughput is the harder constraint.
The Problem
Stripe's payment flow involves fraud detection, authorization, ledger update, and notification agents running in sequence. The fraud agent was updated to return a new risk_score field — but the authorization agent was reading risk_level, the old field name. The authorization agent failed to detect high-risk transactions for 4 hours before the discrepancy surfaced in an alert. This is the canonical multi-agent failure: implicit contracts between agents break silently when one agent evolves independently. At scale, teams own individual agents across different services and release cycles — without explicit, versioned interfaces, any deployment can break the chain.
Core System Idea
Multi-agent coordination requires explicit, versioned contracts between agents — not implicit shared knowledge. Agents communicate via durable message queues (Kafka, SQS, RabbitMQ) or event streams, exchanging self-contained, versioned payloads that include a correlation ID for distributed tracing, a schema version for backward compatibility, and all state the receiving agent needs (no shared mutable stores). Two coordination topologies: orchestration (a central controller owns the workflow state and directs each agent — simpler to debug, single point of failure) and choreography (agents react to events from each other — resilient, but requires robust event infrastructure and more sophisticated per-agent logic). Schema registries (Confluent Schema Registry, AWS Glue) enforce contract compatibility at publish time, not at runtime. Temporal and AWS Step Functions are the dominant orchestration frameworks for durable, auditable multi-agent workflows.
System Flow
Orchestrator sequences agents; each writes to shared state with a versioned payload; failures route to DLQ rather than propagating forward.
Real-World Examples Indicative
Fraud detection, card authorization, payment gateway interaction, and ledger update agents each own a phase of the transaction. A canonical PaymentTransaction object (versioned Protobuf schema) is passed through each step. Idempotency keys at each write step mean that if the ledger agent is retried after a timeout, it doesn't double-credit the account. Schema changes go through a compatibility check before deployment — no agent can emit a breaking schema change without a versioned migration path.
When a title is uploaded, specialized agents handle transcoding (producing 15+ quality variants), DRM packaging, CDN distribution, and metadata indexing. These run as choreographed event-driven workflows — the transcoder emits a TranscodingComplete event, which the DRM packager consumes. Agents scale independently; the transcoding fleet can process at 10,000 concurrent jobs without the DRM fleet needing to match that capacity.
Matching, pricing, and ETA agents coordinate for every ride request. Each runs on its own service with strict latency budgets: matching must respond in <500ms, pricing in <200ms, ETA in <300ms. The coordination layer uses timeouts and cached fallbacks — if the ETA agent is slow, dispatch proceeds with a cached estimate rather than blocking the entire flow. Agents are designed to degrade independently, not fail together.
Anti-Patterns
Agents reading from a shared Redis key that another agent updates directly creates invisible coupling. When Agent A changes the key schema, Agent B breaks with no warning at deployment time — only at runtime.
A payment agent that doesn't deduplicate on retry charges the card twice. Every cross-agent write must carry an idempotency key derived from the input, not generated fresh on each call.
Calling Agent B synchronously inside Agent A's request handler couples their availability. If Agent B is slow or down, Agent A's response time degrades or fails. Use async handoffs via queues; Agent A publishes and returns immediately.
An agent that accepts any input from an upstream agent without schema validation will propagate malformed data — or be exploitable via prompt injection if the payload contains LLM inputs. Validate at every boundary, every time.
Without a shared correlation ID threaded through every agent's logs and spans, post-incident debugging requires correlating timestamps across 5 different log streams. Attach a trace ID at ingress and propagate it through every handoff message.
Design Tradeoffs
| Dimension | Centralized Orchestration | Choreography |
|---|---|---|
| State visibility | Single source of truth | Must correlate across event logs |
| Single point of failure | Yes (orchestrator) | No |
| Debug complexity | One trace through one log | Distributed correlation required |
| Horizontal scale | Bounded by orchestrator capacity | Scales naturally per-agent |
| Best for | Complex conditional workflows | High-throughput event-driven pipelines |
Best Practices
When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Workflow has distinct phases owned by different teams | Two services with a simple request-response contract |
| Agents need to scale independently (transcoding vs. DRM) | Strict sub-100ms latency — coordination adds overhead |
| Partial failures must be recoverable without restarting the whole flow | The entire pipeline can live in one service without operational cost |
| Audit trail of each agent's input and output is required | Teams lack the operational maturity to run distributed tracing across services |