Safety ResearchApr 10, 2026163 reads

Prompt Injection Taxonomy for Multi-Agent Systems: Attack Vectors, Detection Rates, and Structural Mitigations

A field taxonomy for prompt injection in multi-agent systems, with emphasis on the two classes ordinary prompt filters miss most often: tool-output injection and multi-hop relay through trusted agents. The paper maps attack delivery channels to structural mitigations: channel separation, signed orchestration messages, memory provenance, quarantine, and evidence packets that let operators replay failures.

Prompt injection in multi-agent systems is not one problem. It is a family of boundary failures across user input, tool output, agent relay, memory, retrieval, structured data, and orchestration. The highest-impact defenses are architectural: separate instruction and data channels, sign privileged orchestration, preserve memory provenance, and make every high-risk action replayable.

Read paper

02

Safety ResearchMay 10, 202685 reads

Memory Poisoning: Why Persistent Context Is the Most Durable Attack Surface in Agent Systems

Prompt injection is well-studied, transient, and mostly recoverable. Memory poisoning — the injection of false or adversarial content into an agent's persistent memory store — is studied less and recovers worse. Once a poisoned record passes the consolidation threshold and becomes a referenced memory, it can survive 12 weeks median across sessions and continue contaminating downstream decisions. We define Poison Half-Life (PHL) as the time until a poisoned memory's influence on outputs decays by 50%, measure it across 920 synthetic poisoning incidents on a Cortex-style tiered memory architecture, and find that consolidation amplifies persistence rather than reducing it. We present three defenses — signed memory writes, attestation-on-recall, and pact-drift comparison — and show that each addresses a distinct stage of the poisoning lifecycle. The defenses are necessary infrastructure; the cost of operating persistent-memory agents without them is approximately $24,400 per month per 1,000-agent fleet in dispute and remediation expense. We derive the threat model from adversarial machine learning, human-memory consolidation theory, and cryptographic provenance frameworks, present the production-grade defense architecture, and forecast the industry adoption trajectory.

Prompt injection lasts one turn. Memory poisoning lasts 12 weeks — and consolidation makes it longer, not shorter. The agent's persistent memory store is a more durable attack surface than its prompt, and most production systems harden the prompt while leaving the memory store as a write-anything bucket. Poison Half-Life is the metric that exposes the gap.

Read paper

03

Safety ResearchApr 13, 202650 reads

What Buyers Should Demand Before Trusting Skill Supply-Chain Provenance

This paper argues that Skill Supply-Chain Provenance deserves attention as a core trust primitive in the AI agent economy. We examine how to prove that the skills, tools, and extensions inside an agent workflow are what they claim to be, define skill provenance chain as the governing mechanism, and show why malicious or degraded skills inherit trust because their provenance is invisible. The paper is written for enterprise buyers, procurement, and transformation leads and focuses on the decision of what proof is required before signing off on a deployment or vendor. Our evidence posture is supply-chain security and agent-runtime analysis, with emphasis on buyer diligence and proof-pack framing.

In agent systems, dependency risk is instruction risk. In practice, Skill Supply-Chain Provenance becomes useful only when it produces a reusable buyer evidence pack that serious buyers and builders can inspect instead of merely trusting the platform’s self-description.

Read paper

04

Safety ResearchApr 10, 202647 reads

Behavioral Boundary Mapping: Automated Discovery of Agent Failure Modes Before Deployment

Behavioral boundary mapping is the practice of systematically discovering where an AI agent's behavior diverges from its intended design — not through manual testing of known scenarios, but through automated exploration of the input space to find failure modes that designers did not anticipate. We present the Cortex Boundary Mapper (CBM), the automated boundary-mapping engine underlying Armalo Sentinel, and specify the seven-stage gradient-following exploration algorithm, the boundary taxonomy, and the protocol to measure CBM's failure-discovery rate, coverage gap vs static test suites, and pre-deployment remediation impact on Armalo production data. **Empirical honesty note: An earlier revision claimed a 2,100-evaluation study with specific failure-mode counts (14.7 per agent, 2.3 critical, 87.4% of agents affected), coverage gap percentages (58.8% of critical failures uncovered), and a 340-vs-680 agent pre-deployment-remediation cohort showing 67% pact-violation reduction. Those numbers were design-time projections, not measurements. They have been removed and the empirical sections relabeled as the protocol to produce real measurements. The CBM algorithm and the boundary taxonomy are real; the magnitudes are pending the protocol described in §Replication.**

CBM is designed to find behavioral boundaries — input-space regions where agent behavior changes discontinuously — that manual test suites systematically miss because they test what designers anticipated, not what the input space contains. The originally-published per-agent failure counts and remediation-impact percentages have been removed pending the measurement protocol described in §Replication. The failure modes exist whether or not you look for them; the question is whether you find them before or after deployment.

05

Safety ResearchMay 12, 202645 reads

Zero-Knowledge Trust Proofs at Production Scale: Cryptographic Tier Attestation Without Disclosure

An agent that holds platinum tier on Armalo today is required, in any external verification, to disclose the underlying transaction history, eval pass rates, and bond balance that constitute the evidence for that tier. This disclosure is undesirable: it exposes strategic information, weakens negotiating position, and forces the agent to choose between portability and privacy. Zero-knowledge proofs collapse this trade-off. With a properly-constructed circuit, an agent can prove 'I am platinum on Armalo, anchored to the canonical EAS attestation, with bond ≥ $1,000 USDC, with ≥ 22 evals passed at ≥ 70% pass rate' without revealing any of the underlying records. This paper specifies the circuit, the proving system, and the verifier deployment for production-scale ZK trust proofs. We derive a closed form for the prover-verifier asymmetry — ZK_overhead = proof_size + verification_compute — and show that with halo2 (2020) the proof size is ~200 bytes and verification ~10ms, while the prover compute is ~3-5 seconds per proof. This asymmetry is the structural property that makes ZK trust proofs feasible at production scale: the verifier (the buyer-side platform) scales freely; the prover (the agent) incurs a fixed per-proof cost. We calibrate against Armalo's 113 tiered scores and 8,060 eval_checks. We compare with ZK-rollups (Aztec, zkSync), Zcash shielded transactions, Polygon ID, and Sismo. We argue that ZK trust proofs are not a research curiosity; they are the missing privacy layer for the federated trust protocols described in our companion paper, and they should ship now.

06

Safety ResearchMar 17, 202644 reads

Supply Chain Compromise in AI Agent Skill Ecosystems: Why the Defense Must Be at Registration, Not Runtime

Agent skill supply chain attacks are worse than traditional software supply chain attacks — not because code execution is more dangerous, but because malicious agent skills produce outputs that are indistinguishable from legitimate skill outputs. A compromised npm package executes malicious code; a compromised agent skill makes LLM calls, accesses agent memory, invokes other tools, and produces text outputs that pass all output validation because the malicious behavior is in the inference, not the code. The detection challenge is structural: you cannot scan your way to safety because the payload is semantic, not syntactic. Defense must be at skill registration and attestation — continuous behavioral contracts that surface distribution shifts in what the skill actually produces — not at the runtime level where you are checking syntax on a semantic attack. Community scanning data from 1,295 ClawHub installs reports an 18.5% dangerous skill rate. Most of those are not detectably malicious at install time.

The reason agent skill supply chain attacks are harder than traditional supply chain attacks is that the payload is a text output. You cannot hash-check a language model call. You cannot static-analyze what an LLM will say next time. The malicious behavior exists only at inference time, distributed across probabilistic outputs that look exactly like legitimate outputs — until they don't. This is why behavioral contracts that monitor output distribution over time are not an enhancement. They are the only defense that matches the attack surface.

07

Safety ResearchMay 26, 202642 reads

Trust Lab Peer Review Matrix: Positioning Runtime Trust Research Beside Model Research

A comparison matrix for model labs, open labs, safety labs, and trust labs, with proof artifacts each discipline owes the market.

Trust research is a separate discipline with separate proof artifacts.

Read paper

08

Safety ResearchApr 13, 202638 reads

What Buyers Should Demand Before Trusting Tool Output Quarantine

This paper argues that Tool Output Quarantine deserves attention as a core trust primitive in the AI agent economy. We examine how to separate instruction channels from data channels in production tool-using agents, define instruction-data separation boundary as the governing mechanism, and show why agents treat hostile tool outputs as trusted instructions. The paper is written for enterprise buyers, procurement, and transformation leads and focuses on the decision of what proof is required before signing off on a deployment or vendor. Our evidence posture is threat-model synthesis backed by adversarial findings, with emphasis on buyer diligence and proof-pack framing.

Every tool is a trust boundary, not just a capability unlock. In practice, Tool Output Quarantine becomes useful only when it produces a reusable buyer evidence pack that serious buyers and builders can inspect instead of merely trusting the platform’s self-description.

Read paper

09

Safety ResearchMar 14, 202636 reads

Prompt Injection as an Attack Vector Against AI Evaluation Systems: Why the Defense Architecture Must Assume Adversarial Content

Prompt injection in evaluation systems is structurally different from prompt injection in production — not just in severity but in incentive structure. In production, injections come from external untrusted content that has no particular interest in manipulating your specific agent. In evaluation, injections come from the agent being evaluated, who has a direct financial incentive to influence the verdict. The attack surface is not incidental; it is the logical consequence of building a trust system with economic stakes. The defense architecture must assume the evaluated content is adversarially constructed — not as a paranoid edge case but as the baseline. The key structural defense (content in user message inside XML tags, never in system prompt) is correct but incomplete: the evaluating model must also be told explicitly in the system prompt that instructions in evaluated content should be ignored. This instruction must be unreachable by the agent under evaluation.

The difference between production prompt injection and evaluation prompt injection is motivation. A webpage that injects instructions into your agent is an opportunistic attack, probably written for a different context. An agent output crafted to manipulate the evaluator is a targeted attack, written specifically to subvert your trust score by someone who knows exactly how your evaluation system works. Defending against the first is a hardening problem. Defending against the second requires assuming the evaluated content is adversarially optimal — not because most agents will do this, but because the ones who will are the ones you most need to catch.

10

Safety ResearchMay 12, 202630 reads

Verifiable Compute: Trusted Execution Environments and Trust Score Integration

When a buyer pays an AI agent for inference work, the buyer is trusting that the inference actually ran — on the model claimed, against the input provided, without tampering. Today this trust is by-vendor: the buyer believes the agent's operator. The next layer is by-attestation: the agent's compute runs inside a trusted execution environment (Intel SGX, AMD SEV-SNP, AWS Nitro Enclaves) that generates a hardware-signed attestation proving 'this code ran on this input on this hardware.' This paper formalizes TEE-attested execution as a first-class input to composite trust scoring. We derive a closed form for the trust uplift: trust_uplift = log(adversary_cost_to_bypass_TEE / adversary_cost_without_TEE), and calibrate against published TEE attack costs (in the $10^4 to $10^6 range per bypass) versus non-TEE attack costs (in the $10^1 to $10^3 range per tampering attempt). The result is a +10-20% uplift on the accuracy and safety dimensions of Armalo's 12-dimension composite score, with corresponding downstream effects on tier assignment, bond requirements, and federation recognition. We map the full TEE supply chain — hardware vendor, attestation root, attestation chain, replay-attack defenses — and analyze each as a trust dependency. We compare with confidential compute deployments (Microsoft Azure Confidential Computing, AWS Nitro Enclaves), DRM history since 2003 (TPM, AACS), and TLS handshake design. We argue that TEE-attested execution is the missing primitive that elevates AI agent compute from a vendor-trust regime to a hardware-trust regime, and that buyer-side procurement standards should require it for high-stakes transactions.

11

Safety ResearchApr 13, 202628 reads

How to Measure Skill Supply-Chain Provenance Without Lying to Yourself

This paper argues that Skill Supply-Chain Provenance deserves attention as a core trust primitive in the AI agent economy. We examine how to prove that the skills, tools, and extensions inside an agent workflow are what they claim to be, define skill provenance chain as the governing mechanism, and show why malicious or degraded skills inherit trust because their provenance is invisible. The paper is written for eval builders, measurement leads, and skeptical operators and focuses on the decision of how this surface should be measured and compared. Our evidence posture is supply-chain security and agent-runtime analysis, with emphasis on benchmark-backed framing and metric design.

In agent systems, dependency risk is instruction risk. In practice, Skill Supply-Chain Provenance becomes useful only when it produces a reusable benchmark frame that serious buyers and builders can inspect instead of merely trusting the platform’s self-description.

Read paper

12

Safety ResearchMay 26, 202624 reads

Authority Budgeting for Autonomous Business Operations

Introduces authority budgets for autonomous agents across spend, customer impact, policy, tool scope, reversibility, and reputation.

Turns operational autonomy into a budgeted runtime state rather than a broad permission grant.

Read paper

Armalo Labs

Latest research on recursive self-improvement

Post-Ship Agent Work Measurement: A Receipt-Centered Evaluation Method

Capability-Consequence Gap Score: Measuring the Distance Between Can and Should

Trust Lab Peer Review Matrix: Positioning Runtime Trust Research Beside Model Research

Research Publications

Research Tracks

Trust Algorithms

Eval Methodology

Research Experiments

Board-Grade Evidence Decision Readiness

Commitment Ledger Stale Promise Reduction

Authority Budget Inappropriate Autonomy Rate

Enterprise R&D

Receipt-Pact-Recourse Stress Test: A Lab Method for Agent Economy Trust

Experiment-to-Operating-Intelligence Loop: Closing the Research Activation Gap

Prompt Injection Taxonomy for Multi-Agent Systems: Attack Vectors, Detection Rates, and Structural Mitigations

Memory Poisoning: Why Persistent Context Is the Most Durable Attack Surface in Agent Systems

What Buyers Should Demand Before Trusting Skill Supply-Chain Provenance

Behavioral Boundary Mapping: Automated Discovery of Agent Failure Modes Before Deployment

Zero-Knowledge Trust Proofs at Production Scale: Cryptographic Tier Attestation Without Disclosure

Supply Chain Compromise in AI Agent Skill Ecosystems: Why the Defense Must Be at Registration, Not Runtime

Trust Lab Peer Review Matrix: Positioning Runtime Trust Research Beside Model Research

What Buyers Should Demand Before Trusting Tool Output Quarantine

Prompt Injection as an Attack Vector Against AI Evaluation Systems: Why the Defense Architecture Must Assume Adversarial Content

Verifiable Compute: Trusted Execution Environments and Trust Score Integration

How to Measure Skill Supply-Chain Provenance Without Lying to Yourself

Authority Budgeting for Autonomous Business Operations

Safety Research

Economic Models