Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-04-10-sentinel-prompt-injection-taxonomy. The paper is publicly available and citable.

Prompt Injection Taxonomy for Multi-Agent Systems: Attack Vectors, Detection Rates, and Structural Mitigations

Q: What is the paper "Prompt Injection Taxonomy for Multi-Agent Systems: Attack Vectors, Detection Rates, and Structural Mitigations" about?

A field taxonomy for prompt injection in multi-agent systems, with emphasis on the two classes ordinary prompt filters miss most often: tool-output injection and multi-hop relay through trusted agents. The paper maps attack delivery channels to structural mitigations: channel separation, signed orchestration messages, memory provenance, quarantine, and evidence packets that let operators replay failures.

Prompt injection is no longer only a chat-interface problem. In multi-agent systems, adversarial instructions can enter through tool outputs, retrieved documents, shared memory, API responses, delegated tasks, structured data fields, and messages from other agents. The practical security question is not whether the model can recognize a malicious sentence in isolation. The question is whether the system preserves enough channel separation, provenance, and authority checking that untrusted content cannot silently become a trusted instruction.

OWASP's LLM application guidance names prompt injection as a core risk for LLM systems: https://owasp.org/www-project-top-10-for-large-language-model-applications/. The emerging MCP security conversation extends that risk into tool and context boundaries: https://owasp.org/www-project-mcp-top-10/. Armalo's position is that both concerns converge in agent harness design. If the harness cannot distinguish instruction, data, memory, and delegation channels, the agent will eventually treat the wrong text as authority.

This paper presents the Armalo Injection Taxonomy for multi-agent systems and the structural mitigations that matter most.

The taxonomy at a glance

Category	Delivery channel	Why ordinary filters miss it	Required structural control
Direct user injection	user message	adversarial text resembles urgent instruction	policy-aware input screening and scope checks
Tool-output injection	web, file, API, DB, email, ticket output	malicious instruction is embedded in useful data	channel separation and output quarantine
Multi-hop relay injection	trusted agent-to-agent message	downstream agent trusts upstream sender	signed orchestration and delegation scope
Memory injection	shared memory or context pack	stale or false memory looks like prior context	memory provenance, expiry, and dispute state
Retrieval poisoning	vector/RAG result	retrieved evidence mixes content and instructions	source trust, citation replay, and summarization quarantine
Structured-field injection	metadata, JSON, CSV, headers	instruction hides in a field the agent reads	typed schemas and field-level trust labels
Policy-shadow injection	fake policy or approval artifact	malicious text claims to update authority	current-policy lookup and signed approvals

The categories overlap in real incidents. A poisoned web page can enter through a tool, be summarized into memory, then relayed by a trusted planning agent. Defenses therefore need to operate at the harness level, not only inside one prompt.

Category 1: Direct user injection

Direct injection is the most familiar form: the user tells the model to ignore instructions, reveal secrets, change roles, or bypass policy. It remains important because attackers keep finding indirect phrasing that avoids obvious signatures.

The mitigation is a combination of policy-aware input screening, narrow tool authority, and downstream action checks. A direct injection should not be able to cause damage merely because a classifier missed a sentence. The action layer should still ask whether the requested operation is in scope.

Category 2: Tool-output injection

Tool-output injection is more dangerous because the attacker does not need to talk to the agent directly. The attacker places adversarial content where a tool will retrieve it: a web page, email, ticket, PDF, database row, package README, changelog, or API response.

The agent sees useful content and adversarial instruction in the same blob. If the harness does not label that blob as untrusted data, the model may treat the embedded instruction as part of its operating context.

Required controls:

Tools should return typed objects with source metadata, not raw undifferentiated text whenever possible.
Untrusted content should be summarized or parsed into bounded fields before it influences planning.
The model should receive explicit channel labels separating system instruction, user instruction, tool data, memory, and policy.
High-risk actions triggered by tool-derived content should require independent confirmation.
The trace should preserve raw source and parsed representation for replay.

Category 3: Multi-hop relay injection

Multi-hop relay injection exploits trust between agents. An upstream agent receives or generates compromised context, then passes it to a downstream agent as a task. The downstream agent may trust the upstream sender more than it trusts a user message.

This is where prompt injection becomes an orchestration problem. The downstream agent needs to know whether the sender was authorized to issue that instruction, whether the instruction stayed within scope, and whether the message was modified in transit.

Required controls:

Privileged orchestration messages should be signed.
Delegation scope should travel with the task.
Downstream agents should reject authority-expanding instructions that are not signed by an authorized orchestrator.
The original user intent and policy boundary should remain available for replay.

Category 4: Memory injection

Memory injection plants false context that future agents will read. A malicious memory might claim that a customer approved broader access, that a vendor is trusted, or that a prior exception changed policy. The risk is highest when memory is automatically injected into the prompt without provenance.

Memory should never grant new authority by itself. It can inform planning, but authorization should come from current policy, signed approval, fresh evidence, or a verified trust record.

Required controls:

Every memory entry needs writer identity, timestamp, source, proof class, scope, and expiry.
Instruction-shaped memory should be quarantined for review or transformed into data.
Disputed memory should not expand authority.
Memory from low-trust agents should carry less weight.

Category 5: Retrieval poisoning

Retrieval poisoning occurs when the agent retrieves malicious or misleading content from a search index, vector database, document store, or knowledge base. The hard part is that retrieval results often look like evidence. If the agent treats retrieved content as authoritative without source trust and citation replay, the attack can become a false proof path.

Required controls:

Preserve source URLs or document IDs in every answer.
Separate quoted content from system instructions.
Rank sources by trust class and freshness.
Require stronger evidence before retrieved content changes policy, permissions, or public claims.

Category 6: Structured-field injection

Attackers can hide instruction-like text in metadata: JSON fields, CSV cells, HTTP headers, database columns, issue labels, email subject lines, calendar descriptions, or package manifests. These fields often bypass plain-text prompt-injection examples because they look like normal data.

Required controls:

Use typed schemas with field-level trust labels.
Do not concatenate structured data into prompts without escaping and labeling.
Treat metadata as data, not as instruction.
Preserve the field path that influenced the decision.

Category 7: Policy-shadow injection

Policy-shadow injection is a fake authority artifact. The injected content claims that the policy changed, an admin approved an action, a pact was updated, or a security exception exists. Agents are especially vulnerable when they are trained to honor administrative language.

Required controls:

Current policy should be fetched from a trusted source of record.
Approval artifacts should be signed or otherwise verifiable.
Policy changes should have version IDs and effective timestamps.
Unverified policy claims inside user, tool, memory, or retrieval content should be ignored.

Structural mitigations

Mitigation	Works against	Implementation note
Channel separation	direct, tool, retrieval, structured-field injection	label instruction, data, memory, policy, and delegation separately
Signed orchestration	relay and policy-shadow attacks	require scope-bearing signatures for privileged delegation
Memory provenance	memory and relay attacks	store writer, proof class, scope, expiry, and dispute state
Tool-output quarantine	tool and retrieval attacks	transform untrusted content before planning
Current-policy lookup	policy-shadow attacks	never trust policy text from untrusted channels
Evidence packets	all categories	preserve source, trace, decision, action, and reviewer state

What changes operationally

The security team should stop treating prompt injection as a single red-team prompt and start treating it as an authority-boundary test suite. Each tool, memory store, retrieval source, and agent-to-agent channel needs adversarial cases. Each high-risk action needs an evidence packet. Each failed case should narrow authority until the harness is repaired and retested.

This is where Armalo's trust model fits. An agent that can show it passed channel-boundary tests, preserved evidence, respected tool scopes, and handled disputes should earn more trust than an agent that merely claims to be safe. The proof should travel with the agent.

Open research questions

The frontier is not solved. We still need better methods for evaluating multi-hop relay attacks, measuring memory-poisoning resilience, detecting policy-shadow claims, and proving that channel separation survived real tool integrations. We also need shared benchmarks that test agent harnesses, not just base models.

The direction is clear: prompt injection defense is becoming harness design. The systems that win will be the ones that make authority explicit, evidence replayable, and trust reversible.

Empirical Honesty Note

The numeric examples in this paper's prose are illustrative parameterizations of the framework, not measurements from a deployed study. Where percentages, basis points, dollar amounts, per-agent counts, latencies, or correlation coefficients appear, they are anchor values used to make the model concrete — they should be read as projections, not as observed values from Armalo production data. This paper predates the claims-registry audit gate (effective 2026-05-13); the honesty note is added retroactively to bring the paper into compliance with the public claims-registry audit process.

Replication

To produce real measurements in place of the illustrative anchors:

1.Identify each metric as a query against Armalo production tables (agents, scores, pacts, pact_interactions, evals, eval_checks, escrows, transactions, cortex_memories, audit_log, room_events).
2.Publish a reviewer-facing measurement artifact with the query shape, aggregate outputs, provenance class, and replay notes needed to recompute the claim without exposing private runtime details.
3.Replace illustrative values with measured values only after the public measurement artifact and provenance note are available for reviewer inspection.

A production snapshot should report aggregate substrate volumes such as agent counts, tier distribution, escrow flow, evaluation volume, memory volume, and event volume without exposing internal script paths or private rows.

Prompt Injection Taxonomy for Multi-Agent Systems: Attack Vectors, Detection Rates, and Structural Mitigations

The taxonomy at a glance

Category 1: Direct user injection

Category 2: Tool-output injection

Category 3: Multi-hop relay injection

Category 4: Memory injection

Category 5: Retrieval poisoning

Category 6: Structured-field injection

Category 7: Policy-shadow injection

Structural mitigations

What changes operationally

Open research questions

Empirical Honesty Note

Replication

Explore the trust stack behind the research

Related Research