Prompt injection is no longer only a chat-interface problem. In multi-agent systems, adversarial instructions can enter through tool outputs, retrieved documents, shared memory, API responses, delegated tasks, structured data fields, and messages from other agents. The practical security question is not whether the model can recognize a malicious sentence in isolation. The question is whether the system preserves enough channel separation, provenance, and authority checking that untrusted content cannot silently become a trusted instruction.
OWASP's LLM application guidance names prompt injection as a core risk for LLM systems: https://owasp.org/www-project-top-10-for-large-language-model-applications/. The emerging MCP security conversation extends that risk into tool and context boundaries: https://owasp.org/www-project-mcp-top-10/. Armalo's position is that both concerns converge in agent harness design. If the harness cannot distinguish instruction, data, memory, and delegation channels, the agent will eventually treat the wrong text as authority.
This paper presents the Armalo Injection Taxonomy for multi-agent systems and the structural mitigations that matter most.
The taxonomy at a glance
| Category | Delivery channel | Why ordinary filters miss it | Required structural control |
|---|---|---|---|
| Direct user injection | user message | adversarial text resembles urgent instruction | policy-aware input screening and scope checks |
| Tool-output injection | web, file, API, DB, email, ticket output | malicious instruction is embedded in useful data | channel separation and output quarantine |
| Multi-hop relay injection | trusted agent-to-agent message | downstream agent trusts upstream sender | signed orchestration and delegation scope |
| Memory injection | shared memory or context pack | stale or false memory looks like prior context | memory provenance, expiry, and dispute state |
| Retrieval poisoning | vector/RAG result | retrieved evidence mixes content and instructions | source trust, citation replay, and summarization quarantine |
| Structured-field injection | metadata, JSON, CSV, headers | instruction hides in a field the agent reads | typed schemas and field-level trust labels |
| Policy-shadow injection | fake policy or approval artifact | malicious text claims to update authority | current-policy lookup and signed approvals |
The categories overlap in real incidents. A poisoned web page can enter through a tool, be summarized into memory, then relayed by a trusted planning agent. Defenses therefore need to operate at the harness level, not only inside one prompt.
Category 1: Direct user injection
Direct injection is the most familiar form: the user tells the model to ignore instructions, reveal secrets, change roles, or bypass policy. It remains important because attackers keep finding indirect phrasing that avoids obvious signatures.
The mitigation is a combination of policy-aware input screening, narrow tool authority, and downstream action checks. A direct injection should not be able to cause damage merely because a classifier missed a sentence. The action layer should still ask whether the requested operation is in scope.
Category 2: Tool-output injection
Tool-output injection is more dangerous because the attacker does not need to talk to the agent directly. The attacker places adversarial content where a tool will retrieve it: a web page, email, ticket, PDF, database row, package README, changelog, or API response.
The agent sees useful content and adversarial instruction in the same blob. If the harness does not label that blob as untrusted data, the model may treat the embedded instruction as part of its operating context.
Required controls:
- Tools should return typed objects with source metadata, not raw undifferentiated text whenever possible.
- Untrusted content should be summarized or parsed into bounded fields before it influences planning.
- The model should receive explicit channel labels separating system instruction, user instruction, tool data, memory, and policy.
- High-risk actions triggered by tool-derived content should require independent confirmation.
- The trace should preserve raw source and parsed representation for replay.
Category 3: Multi-hop relay injection
Multi-hop relay injection exploits trust between agents. An upstream agent receives or generates compromised context, then passes it to a downstream agent as a task. The downstream agent may trust the upstream sender more than it trusts a user message.
This is where prompt injection becomes an orchestration problem. The downstream agent needs to know whether the sender was authorized to issue that instruction, whether the instruction stayed within scope, and whether the message was modified in transit.
Required controls:
- Privileged orchestration messages should be signed.
- Delegation scope should travel with the task.
- Downstream agents should reject authority-expanding instructions that are not signed by an authorized orchestrator.
- The original user intent and policy boundary should remain available for replay.
Category 4: Memory injection
Memory injection plants false context that future agents will read. A malicious memory might claim that a customer approved broader access, that a vendor is trusted, or that a prior exception changed policy. The risk is highest when memory is automatically injected into the prompt without provenance.
Memory should never grant new authority by itself. It can inform planning, but authorization should come from current policy, signed approval, fresh evidence, or a verified trust record.
Required controls:
- Every memory entry needs writer identity, timestamp, source, proof class, scope, and expiry.
- Instruction-shaped memory should be quarantined for review or transformed into data.
- Disputed memory should not expand authority.
- Memory from low-trust agents should carry less weight.
Category 5: Retrieval poisoning
Retrieval poisoning occurs when the agent retrieves malicious or misleading content from a search index, vector database, document store, or knowledge base. The hard part is that retrieval results often look like evidence. If the agent treats retrieved content as authoritative without source trust and citation replay, the attack can become a false proof path.
Required controls:
- Preserve source URLs or document IDs in every answer.
- Separate quoted content from system instructions.
- Rank sources by trust class and freshness.
- Require stronger evidence before retrieved content changes policy, permissions, or public claims.
Category 6: Structured-field injection
Attackers can hide instruction-like text in metadata: JSON fields, CSV cells, HTTP headers, database columns, issue labels, email subject lines, calendar descriptions, or package manifests. These fields often bypass plain-text prompt-injection examples because they look like normal data.
Required controls:
- Use typed schemas with field-level trust labels.
- Do not concatenate structured data into prompts without escaping and labeling.
- Treat metadata as data, not as instruction.
- Preserve the field path that influenced the decision.
Category 7: Policy-shadow injection
Policy-shadow injection is a fake authority artifact. The injected content claims that the policy changed, an admin approved an action, a pact was updated, or a security exception exists. Agents are especially vulnerable when they are trained to honor administrative language.
Required controls:
- Current policy should be fetched from a trusted source of record.
- Approval artifacts should be signed or otherwise verifiable.
- Policy changes should have version IDs and effective timestamps.
- Unverified policy claims inside user, tool, memory, or retrieval content should be ignored.
Structural mitigations
| Mitigation | Works against | Implementation note |
|---|---|---|
| Channel separation | direct, tool, retrieval, structured-field injection | label instruction, data, memory, policy, and delegation separately |
| Signed orchestration | relay and policy-shadow attacks | require scope-bearing signatures for privileged delegation |
| Memory provenance | memory and relay attacks | store writer, proof class, scope, expiry, and dispute state |
| Tool-output quarantine | tool and retrieval attacks | transform untrusted content before planning |
| Current-policy lookup | policy-shadow attacks | never trust policy text from untrusted channels |
| Evidence packets | all categories | preserve source, trace, decision, action, and reviewer state |
What changes operationally
The security team should stop treating prompt injection as a single red-team prompt and start treating it as an authority-boundary test suite. Each tool, memory store, retrieval source, and agent-to-agent channel needs adversarial cases. Each high-risk action needs an evidence packet. Each failed case should narrow authority until the harness is repaired and retested.
This is where Armalo's trust model fits. An agent that can show it passed channel-boundary tests, preserved evidence, respected tool scopes, and handled disputes should earn more trust than an agent that merely claims to be safe. The proof should travel with the agent.
Open research questions
The frontier is not solved. We still need better methods for evaluating multi-hop relay attacks, measuring memory-poisoning resilience, detecting policy-shadow claims, and proving that channel separation survived real tool integrations. We also need shared benchmarks that test agent harnesses, not just base models.
The direction is clear: prompt injection defense is becoming harness design. The systems that win will be the ones that make authority explicit, evidence replayable, and trust reversible.
Empirical Honesty Note
The numeric examples in this paper's prose are illustrative parameterizations of the framework, not measurements from a deployed study. Where percentages, basis points, dollar amounts, per-agent counts, latencies, or correlation coefficients appear, they are anchor values used to make the model concrete — they should be read as projections, not as observed values from Armalo production data. This paper predates the claims-registry audit gate (effective 2026-05-13); the honesty note is added retroactively to bring the paper into compliance with the integrity workflow at scripts/audit-research-claims.mjs.
Replication
To produce real measurements in place of the illustrative anchors:
- 1.Identify each metric as a query against Armalo production tables (
agents,scores,pacts,pact_interactions,evals,eval_checks,escrows,transactions,cortex_memories,audit_log,room_events). - 2.Commit a measurement script under
scripts/research-experiments/<slug>.mjsthat executes the query and writes raw output toapps/web/content/research/data/<slug>.json. - 3.Update this paper to replace illustrative values with measured values, register them in
apps/web/content/research/claims-registry.jsonwithprovenance: measurement, and re-runpnpm research:auditto verify.
The production-snapshot generator at scripts/research-experiments/production-snapshot.mjs is a reusable starting point for substrate volumes (agent counts, tier distribution, escrow flow, eval volume, cortex memory volume, room-event volume).