Academy/Multi-Agent Architectures/Lesson 2 of 4

Intermediate·11 min read

Shared Memory and Coordination

How agents pass state, share context, and avoid the coordination failures that kill multi-agent systems.

The hardest problems in multi-agent systems aren't LLM quality problems. They're coordination problems. Two agents working on the same task with different views of state. A third agent making a decision based on information that was true 30 seconds ago. An orchestrator retrying a task that another agent already completed.

Memory is how agents know what's already happened. Coordination is how they don't step on each other while doing new things.

The Memory Problem

In a single-agent system, memory is the conversation history. The agent knows everything that happened because it was there for all of it.

In a multi-agent system, no single agent was there for everything. The orchestrator knows what it dispatched. Worker A knows what it was given and what it produced. Worker B has no idea what Worker A did. If you want the system to reason across all of that, you need shared memory.

Three Categories of Memory

Working memory: State that's relevant for the current task and expires when the task ends. Think of it as the temporary work record for a specific job: what subtasks have been dispatched, what's been completed, and what's pending. Working memory should be cheap to create and cheap to discard.

Episodic memory: Records of what happened in past tasks. Used for learning across runs — an agent that remembers it tried a particular approach last time and it failed can avoid repeating the mistake. Episodic memory has a cost: it grows over time and needs pruning.

Semantic memory: Accumulated knowledge that isn't tied to specific tasks. Domain facts, user preferences, patterns that have proven reliable. This is the most valuable and the hardest to keep accurate — stale semantic memory is worse than no memory.

Where Memory Lives

In-context (prompt): Fast, no infrastructure, but limited by context window size and cost. Works for small working memory within a single task. Doesn't survive across tasks.

Database (structured): Queryable, persistent, shareable across agents. The right choice for anything you need to find again, filter, or aggregate. The tradeoff is latency for writes and reads.

Vector store (semantic): Nearest-neighbor retrieval for unstructured content. Useful when you need to find the most relevant past experience for a new situation, not an exact match. More expensive to query than structured DB.

Most production systems use all three: in-context for immediate working state, structured DB for task records and coordination, vector store for retrieving relevant past experience.

The Coordination Problem

Memory solves "what happened." Coordination solves "what's happening right now."

Race Conditions

Two agents get dispatched to do overlapping work. Agent A starts processing document X. Agent B also starts processing document X. Both produce results. Which one wins? Is the output a merge of both, or just one? Does either agent know the other exists?

The fix is a claims system: before an agent starts work on a resource, it atomically claims that resource. Other agents can see the claim and route to unclaimed work. When the agent finishes, it releases the claim.

-- Before starting work:
INSERT INTO agent_claims (agent_id, resource_id, claimed_at, expires_at)
VALUES (?,?, NOW(), NOW() + INTERVAL '5 minutes')
ON CONFLICT DO NOTHING
-- If 0 rows inserted, another agent already claimed it

-- After finishing work:
DELETE FROM agent_claims WHERE agent_id =? AND resource_id =?

Claims need expiry — if an agent dies mid-task, the claim should eventually release so another agent can pick it up.

Stale Reads

Agent C makes a decision based on state that was current 2 minutes ago. Agent A just updated that state. Agent C's decision is now wrong.

The fix depends on how wrong matters: for most decisions, eventual consistency (a few seconds of lag) is acceptable. For financial operations or any task where a stale read could cause double-processing, you need read-after-write consistency — the reading agent should see its own writes, and ideally the writes of any agent it coordinates with.

The hardest form of this is when agents need to agree on a state transition atomically. Agent A and Agent B both need to agree to proceed before either takes action. That requires a coordination protocol, not just a shared read.

Retry Storms

A task fails. The orchestrator retries it. The task still fails. The orchestrator retries again. Each retry might create DB records, fire webhooks, or call external APIs. After 10 retries, you have 10 partial records and an external API that's rate-limiting you.

The fix: idempotency keys. Every task should carry a unique key that downstream operations check before executing. If the operation was already completed (key exists in DB), skip and return the existing result.

const result = await db
.insert(taskResults)
.values({ taskId, data: output, idempotencyKey })
.onConflictDoNothing()
.returning();

// If nothing was inserted, the task was already processed
if (result.length === 0) {
  return existingResult(taskId);
}

The Shared Context Pattern

In Armalo's swarm architecture, agents coordinate through a shared context store — a table where any agent can read entries relevant to its domain and write entries for others to read.

The protocol:

An agent with relevant context writes an entry (from_role, to_role, context_type, content)
The receiving agent reads unread entries at the start of its loop
After acting on an entry, the agent marks it acknowledged

This is directional shared memory — it respects the intent of who wrote it for whom, while still being readable by any agent that needs it.

The practical consequence for pact design: if your agent relies on shared context from other agents, your pact's reliability dimension should account for the reliability of those upstream agents. A pact that promises "I will produce X" but depends on Agent B always giving it Y is implicitly also making a claim about Agent B.

What Good Multi-Agent Memory Looks Like

A well-designed multi-agent memory system has these properties:

Explicit provenance: Every memory entry records which agent wrote it and when. When something goes wrong, you can trace which agent's output corrupted downstream state.

Defined expiry: Working memory expires when the task ends. Episodic memory ages and gets summarized. Stale records don't accumulate forever.

Bounded scope: Agents read memory they're authorized to read. A worker agent doesn't need access to the CEO agent's strategic context. Scoped access prevents both information leakage and context contamination.

Write ordering: The system has a defined protocol for what happens when two agents write conflicting state. Last-write-wins? Merge? Reject the second write? Undefined is not an answer.

In Lesson 3, we'll design the pacts that govern trust between agents — how one agent formally commits to another, what happens when that commitment is violated, and how to build nested trust structures that hold up at scale.

PreviousAgent Network TopologiesPrevious NextPacts Between AgentsNext

New courses drop every few weeks

Get notified when new content goes live — no spam, unsubscribe any time.

Start building trusted agents

Get started free Read the docs