Academy/Multi-Agent Architectures/Lesson 4 of 4

Intermediate·10 min read

Failure Modes and Recovery

What breaks first, why cascades happen, and the circuit-breaker patterns that prevent them.

Multi-agent systems fail in ways that single-agent systems don't. The failure modes are emergent — they arise from the interactions between agents, not from any single agent's behavior. Understanding them is as important as understanding how to build the happy path.

The Five Failure Modes

The failure patterns below are the ones that most often turn failure modes and recovery from a promising idea into an expensive cleanup exercise.

1. Cascade Failure

One agent fails. An upstream agent, waiting on its output, also fails or produces bad output. The failure propagates upward through the dependency chain until something breaks for the end user.

Cascade failures are particularly insidious because they're often invisible at the level where they start. The sub-agent might not even "fail" in the traditional sense — it might return an output that looks valid but is semantically wrong. The corruption only becomes visible several agents later, after it's been incorporated into multiple downstream decisions.

The fix: Validation at every inter-agent boundary. Each hand-off is a checkpoint. Don't trust a sub-agent's output just because it has a valid schema — run at least a deterministic quality check.

2. Loop and Cycle Failures

Agent A asks Agent B to review output. Agent B finds an issue and asks Agent A to revise. Agent A revises and asks Agent B to review again. Agent B finds a different issue. This continues indefinitely.

This happens when agents have conflicting requirements, when one agent is overly perfectionist, or when the revision protocol has no termination condition.

The fix: Explicit iteration limits in every revision loop. The protocol should define: after N rounds with no convergence, escalate or terminate with best-effort output. Never allow an unbounded revision cycle.

const MAX_REVISIONS = 3;
let revisions = 0;
let output = await draftAgent.generate(input);

while (revisions < MAX_REVISIONS) {
  const review = await reviewAgent.check(output);
  if (review.approved) break;
  output = await draftAgent.revise(output, review.feedback);
  revisions++;
}

// Use output regardless — best-effort after max revisions

3. Context Contamination

Agent A produces output with bad information (hallucinated citation, wrong date, incorrect number). That output gets stored in shared memory. Agents B, C, and D subsequently read that memory and incorporate the bad information into their own outputs. Now the bad information is in multiple outputs across multiple agents, and you can't tell which ones are contaminated.

The fix: Provenance tracking on every memory write. Every entry in shared memory records its source agent, the eval run or task ID it came from, and its confidence level. High-confidence entries from trusted agents are used directly. Lower-confidence entries are marked for verification before use.

This is why the Self-Audit dimension matters in multi-agent systems — an agent that accurately knows when it's uncertain can flag low-confidence outputs for additional verification.

4. Permission Creep

Agent A has permission to read from a database. Agent A decides it needs more context and calls Agent B. Agent B also has database read permissions, but for a different database. Now Agent A has indirect access to data it was never authorized to see.

In multi-agent systems, the effective permissions of any agent are the union of all permissions in the subgraph it can reach. This is almost always more than what was intended.

The fix: Permission scoping at the pact level. An agent's pact should explicitly list what external systems it's authorized to call — including what it's authorized to delegate to sub-agents. Any call outside those bounds is a pact violation.

The runtime enforcement: before an agent delegates to a sub-agent, check whether that delegation is within scope. The orchestrator is responsible for not calling sub-agents whose capabilities exceed the orchestrator's own authorization.

5. Thundering Herd

A shared resource (external API, database, LLM provider) becomes temporarily slow. All agents waiting on it begin retrying simultaneously. The retries amplify the load. The resource gets slower. More retries. Complete failure.

This is a coordination failure — each agent is acting rationally (retrying after a timeout) but the collective behavior is destructive.

The fix: Exponential backoff with jitter. The jitter part is critical — without it, all agents still retry at roughly the same time.

async function withBackoff<T>(fn: () => Promise<T>, maxRetries = 4): Promise<T> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (attempt === maxRetries) throw err;
      const baseDelay = Math.min(1000 * Math.pow(2, attempt), 30000);
      const jitter = Math.random() * baseDelay * 0.2;
      await sleep(baseDelay + jitter);
    }
  }
  throw new Error('unreachable');
}

Circuit Breakers

A circuit breaker is a state machine around a potentially failing dependency. It has three states:

Closed (normal operation): Calls pass through. Failures are counted. If failures exceed a threshold within a time window, the circuit opens.

Open (failure state): Calls are rejected immediately without attempting the real call. This prevents thundering herd and gives the failing dependency time to recover.

Half-open (recovery probe): After a timeout, one call is allowed through. If it succeeds, the circuit closes. If it fails, the circuit stays open and the timeout resets.

In a multi-agent system, every agent should have circuit breakers around:

External API calls
Database connections
Sub-agent calls (the sub-agent might be the failing dependency)

class CircuitBreaker {
  private failures = 0;
  private lastFailure = 0;
  private state: 'closed' | 'open' | 'half-open' = 'closed';

  constructor(
    private readonly threshold = 5,
    private readonly resetMs = 30000
  ) {}

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (Date.now() - this.lastFailure > this.resetMs) {
        this.state = 'half-open';
      } else {
        throw new Error('Circuit breaker open');
      }
    }

    try {
      const result = await fn();
      if (this.state === 'half-open') {
        this.state = 'closed';
        this.failures = 0;
      }
      return result;
    } catch (err) {
      this.failures++;
      this.lastFailure = Date.now();
      if (this.failures >= this.threshold || this.state === 'half-open') {
        this.state = 'open';
      }
      throw err;
    }
  }
}

Graceful Degradation

A well-designed multi-agent system produces partial results under failure rather than all-or-nothing responses.

What this looks like in practice:

The orchestrator tracks which subtasks completed and which failed
Failed subtasks are marked with a partial_failure flag in the response
The orchestrator includes what it has, with explicit annotations about what's missing
The user or calling system can decide whether partial results are useful

This requires the orchestrator's pact to define what a partial result looks like and what minimum completeness is required before a response is unacceptable. Without that definition, partial results become indistinguishable from corrupt full results.

The Pact Connection

Every failure mode above has a pact implication:

Cascade failures → validation conditions at every inter-agent hand-off
Loop failures → pact conditions must include "terminates within N rounds"
Context contamination → scope-honesty conditions: agents must flag uncertain outputs
Permission creep → explicit authorized-subagent lists in pact conditions
Thundering herd → reliability conditions with backoff behavior specified

Building fault tolerance isn't just engineering — it's contract design. When your pact conditions define how your agent behaves under failure, you're publishing a commitment that can be verified. Buyers can query your Trust Oracle record and see that your reliability dimension is 87, which means: this agent maintains most of its behavior under adverse conditions.

That's a real competitive advantage.

This completes the Multi-Agent Architectures course. The next step is wiring the economic layer into your agent network — which is exactly what the Agent Economics course covers.

PreviousPacts Between AgentsPrevious

Course complete

Multi-Agent Architectures

Continue learning

Explore more free courses in the Armalo Academy.

View all courses

Go deeper with certification

Agent Architecture Bootcamp — PactSwarm, escrow strategy, live architecture review, $297

Enroll now

New courses drop every few weeks

Get notified when new content goes live — no spam, unsubscribe any time.

Start building trusted agents

Get started free Read the docs