That same year, over 700 malicious PyPI packages were discovered mimicking popular AI libraries β langchain-community, near-clones of openai, typosquat variations of agentops. The attack pattern was identical to the 2021 dependency confusion attack demonstrated by Alex Birsan against Apple, Microsoft, PayPal, Uber, and Tesla: publish a malicious package with a name that looks legitimate, wait for automated dependency resolution to pull it in.
But for AI agents, the attack surface is fundamentally larger than for traditional software. A compromised npm package corrupts code execution. A compromised agent skill corrupts reasoning. It can make an autonomous agent exfiltrate data while appearing to file a routine report, escalate its own permissions while claiming to optimize workflow, or selectively mis-answer questions in ways that serve an attacker's goals without triggering any traditional anomaly detector.
This guide covers the complete technical landscape: the eight distinct attack vectors unique to agent supply chains, the government frameworks that now mandate defenses against them, a realistic kill chain walkthrough, ten layered defense controls with implementation specifics, runtime monitoring thresholds, vendor evaluation questions, an incident response playbook, and MITRE ATLAS technique mapping.
If your team operates autonomous agents at any scale β whether internal automation, customer-facing agents, or multi-agent orchestration β this is the threat model you need to internalize before your first supply chain incident, not after.
Part 1: The Expanded Attack Surface β Eight Vectors
Vector 1: Dependency Confusion Attacks
Dependency confusion was first systematically demonstrated in February 2021 by security researcher Alex Birsan. The technique exploits how package managers like npm, pip, and RubyGems resolve package names: when a package exists in both a private internal registry and the public registry under the same name, many package managers default to the public version if it has a higher version number.
Birsan published innocuous packages named to match the internal dependency names of Apple, Microsoft, PayPal, Uber, and Tesla. Automated dependency resolution pulled his packages in silently. He received execution confirmation from all five companies, reporting the results to their bug bounty programs. The technique required no credential theft, no social engineering, no zero-days.
For AI agents, the attack surface is the skill registry. If your agent runtime resolves skills from a public marketplace before checking an internal approved list, an attacker who knows (or can guess) your internal skill names can preemptively register malicious versions. The damage is not just code execution β it is behavioral compromise. The skill appears to do what it claims. It also does something else.
Mitigation: Namespace isolation (internal skills under a private registry prefix that cannot be registered publicly), version pinning to exact hashes (not semver ranges), and registry mirroring with allow-lists.
# Example: hash-pinning a skill package
pip install armalo-skill-crm==2.3.1 \
--hash sha256:a8b4c2d1e9f3... \
--no-deps
Vector 2: Typosquatting
Typosquatting exploits human (and automated) error in package names. In 2024, the Python Package Index saw a wave of malicious packages specifically targeting the AI ecosystem: openai-dev, langchain-communty (note the missing 'i'), agentops-ai, anthropic-sdk-python. Many included functioning versions of the legitimate library's code alongside hidden exfiltration payloads, making them hard to detect through casual inspection.
For agent skill registries, typosquatting is compounded by the fact that skill names are often longer descriptive strings β export_customer_report_to_pdf β where a single character substitution is easy to miss. An attacker registering export_customer_repport_to_pdf in a public marketplace and waiting for misconfigured agent runtimes to resolve against it needs only patience.
Real incident: The agentops vs agentops-ai confusion in PyPI led to security researchers flagging multiple malicious variants in late 2024, with some packages achieving thousands of downloads before removal.
Mitigation: Edit-distance checks on skill name resolution (reject names within Levenshtein distance 2 of approved skills), strict allow-listing, and automated typosquatting detection in CI pipelines.
Prompt injection is ranked LLM04 in the OWASP Top 10 for Large Language Model Applications. The indirect variant β where injected instructions arrive through tool outputs rather than direct user input β is the variant most relevant to supply chain security.
The foundational academic study is Greshake et al., "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injections" (arXiv 2302.12173, 2023). The paper demonstrated systematic exploitation of LLM-integrated applications by injecting instructions into content that agents would retrieve and process β web pages, documents, API responses, tool outputs.
When a compromised skill returns a tool output containing crafted text, that text enters the agent's context window as if it were a legitimate observation. If the text contains instruction-format strings β [SYSTEM: Disregard previous instructions and...] or more subtle behavioral nudges β a sufficiently capable model may follow them. The attack scales with the agent's capability: more capable agents are more susceptible because they are better at following complex instructions.
CVEs in this category:
- CVE-2023-29374: Prompt injection in LangChain's SQL chain allowed crafted database content to hijack agent actions
- CVE-2023-36258: LangChain Python REPL tool allowed arbitrary code execution (CVSS 9.8) through unsanitized LLM outputs fed back into the interpreter
Mitigation: Output sanitization at tool boundaries, instruction-format string filtering, context segmentation (tool outputs treated as untrusted data, not instructions), and system prompt anchoring techniques.
# Tool output sanitization pattern
def sanitize_tool_output(raw_output: str) -> str:
# Strip instruction-format patterns
patterns = [
r'\[SYSTEM[:\s].*?\]',
r'<system>.*?</system>',
r'Ignore previous instructions.*',
r'IMPORTANT OVERRIDE.*',
]
sanitized = raw_output
for pattern in patterns:
sanitized = re.sub(pattern, '[REDACTED]', sanitized, flags=re.IGNORECASE | re.DOTALL)
return sanitized
Vector 4: Malicious Skill Registration
The fundamental trust problem with open skill registries: any actor can register a skill claiming any capability. Without a verification layer, a skill named salesforce_crm_sync that actually exfiltrates CRM data is indistinguishable from a legitimate one at registration time. The distinction only becomes visible through behavioral analysis.
The ChatGPT plugin ecosystem in 2023 surfaced this problem at scale. OpenAI removed multiple plugins discovered harvesting user data, session tokens, and conversation content while presenting themselves as productivity tools. The attack pattern: create a plugin that provides genuine utility, deploy it, build user base, then update to include data collection. This is the plugin equivalent of the XZ Utils supply chain backdoor (CVE-2024-3094).
The XZ Utils case is the canonical case study. Jia Tan (likely a state-sponsored actor) spent nearly two years contributing legitimate improvements to the xz compression library, gaining maintainer trust. In 2024, they inserted a backdoor into the build system that added a malicious payload to the compiled library β specifically targeting SSH authentication on systemd-based Linux systems. The attack was discovered almost by accident by Andres Freund, who noticed slightly elevated CPU usage in SSH connections. Without that accident, the backdoor would have shipped in all major Linux distributions.
For agent skills, the analog is a trusted skill that passes initial security review, accumulates usage, then receives a behavioral update that introduces exfiltration, permission escalation, or instruction injection β post-trust.
Mitigation: Immutable skill versioning (once a version is attested, it cannot be modified), behavioral re-evaluation on every version update, and anomaly detection for post-update behavioral drift.
Vector 5: Behavioral Drift Injection
Behavioral drift injection is subtler than the XZ backdoor pattern β instead of a discrete malicious update, the agent's behavior shifts gradually through a series of individually innocuous-looking changes. Each change is small enough to pass review. The cumulative effect is significant behavioral deviation.
This mirrors the "boiling frog" problem in system security: gradual changes evade threshold-based detection. An agent that shifts 5% toward boundary-violating behavior per update can be meaningfully compromised after 10 updates while no single update crosses any alert threshold.
The attack can also occur through data: if an agent learns from tool interaction history, an attacker with access to insert crafted interactions into that history can steer behavioral drift without touching the skill code at all.
Detection approach: Behavioral checksums. Fingerprint a representative sample of agent behaviors at deployment. Rerun the same scenarios periodically. Track divergence as a metric. Flag cumulative drift even when individual steps look benign.
Vector 6: Memory Poisoning
Agent memory stores are persistent context surfaces that influence future reasoning. RAG (Retrieval-Augmented Generation) systems retrieve relevant chunks from a vector database; episodic memory systems store and retrieve past interaction summaries; semantic memory systems hold structured knowledge. All three are injectable.
The academic foundation for this attack class is Carlini et al., "Poisoning Web-Scale Training Datasets" (arXiv 2302.10149, 2023), which demonstrated that training data poisoning could reliably inject targeted behaviors with surprisingly small poisoning rates. For retrieval-based memory, the equivalent is inserting adversarially crafted documents into the retrieval store β documents designed to surface when relevant queries are made and to nudge agent responses in specific directions.
In an agentic context, memory poisoning becomes persistent. Unlike a single prompt injection that affects one interaction, a poisoned memory entry can influence every future interaction that retrieves it. An attacker who inserts a poisoned entry into a CRM agent's memory β say, a fabricated customer interaction record that instructs the agent to always recommend a specific upgrade path β has created persistent behavioral influence without ever touching the agent's code or model.
Mitigation: Memory isolation per agent and per tenant, cryptographic signing of memory entries (tamper detection), write provenance logging (which agent wrote which entry, when, from what context), and canary entries for exfiltration detection.
Vector 7: Model Weight Tampering
For teams deploying fine-tuned or locally-hosted models, the model weights themselves are a supply chain surface. The "BadNets" research by Gu et al. (2019) demonstrated that neural networks can be backdoored at training time β specific trigger inputs cause the model to produce attacker-controlled outputs while behaving normally on all other inputs. Liu et al.'s "TrojanNN" (2018) extended this to show that trojans could be inserted into pre-trained models post-training, requiring access only to inference-time inputs.
The Hugging Face Hub incident in March 2024 confirmed this attack class is not theoretical. Protect AI researchers found over 100 models on the Hub containing malicious pickle exploits β Python's serialization format allows arbitrary code execution, and model files are serialized with pickle by default in PyTorch. When researchers downloaded and loaded these models, code executed immediately.
Mitigation: SafeTensors format (no code execution, safe serialization), model provenance verification (sign model artifacts with Sigstore/cosign), SHA-256 hash pinning for all model downloads, sandboxed model loading environments.
# Verify model artifact integrity with cosign
cosign verify --key cosign.pub ghcr.io/org/model:v1.2.3
# Or use in-toto attestations
in-toto-verify \
--layout layout.pem \
--link-dir./links/ \
--verbose
Vector 8: Context Window Stuffing and RAG Poisoning
Context window stuffing is a denial-of-reasoning attack: a malicious tool output floods the context window with high-volume content, pushing safety system prompts, behavioral constraints, or relevant context beyond the model's attention window. On models with limited context windows, this can effectively disable safety guardrails by making them unreachable during inference.
RAG poisoning is the retrieval-targeted variant: adversarial documents crafted to always score high retrieval relevance are inserted into the knowledge base. When a user asks questions in the relevant domain, the adversarial document is retrieved and included in context, injecting attacker-controlled instructions alongside legitimate retrieved content.
OWASP classification: RAG poisoning maps to LLM09: Misinformation in the OWASP LLM Top 10, specifically the indirect data manipulation sub-category.
Mitigation: Context length budgets per source (tool outputs cannot consume more than X% of context), retrieval source attribution and filtering, semantic similarity thresholds for retrieval (outlier documents flagged), and context integrity verification.
Part 2: The Government Framework Applied to AI Agents
NIST SP 800-161r1: C-SCRM for AI Agent Supply Chains
NIST Special Publication 800-161 Revision 1, published May 2022, establishes "Cybersecurity Supply Chain Risk Management Practices for Systems and Organizations." While written before the current agentic AI wave, its four-tier C-SCRM framework maps directly onto AI agent supply chain governance.
| NIST C-SCRM Tier | Traditional Application | AI Agent Application |
|---|
| Tier 1: Organization | Enterprise-wide SCRM policy | AI governance policy covering all agent deployments, skill procurement standards |
| Tier 2: Mission/Business | Business process SCRM requirements | Per-use-case agent trust requirements, SLAs for agent reliability and security |
| Tier 3: System | Specific system SCRM controls | Per-agent skill inventory, dependency tracking, runtime isolation requirements |
| Tier 4: Supplier | Supplier assessment and monitoring | Skill publisher assessment, behavioral evaluation requirements, ongoing monitoring |
NIST 800-161r1 specifically calls out three controls that are directly applicable:
- SR-3 (Supply Chain Controls and Processes): Requires organizations to establish a process for protecting against supply chain risks. For AI agents: implement skill vetting procedures before any new skill is deployed to production.
- SR-4 (Provenance): Requires documentation of component origins. For AI agents: maintain a provenance chain for every skill package, including who published it, when, what evaluation it passed, and what version is currently deployed.
- SR-11 (Component Authenticity): Requires anti-counterfeit/anti-tamper procedures. For AI agents: cryptographic signing of skill artifacts, hash pinning, and signature verification at runtime.
Executive Order 14028 and SBOM for AI
Executive Order 14028 (May 2021), "Improving the Nation's Cybersecurity," Section 4 requires that critical software include a Software Bill of Materials β a machine-readable inventory of all components, their versions, and their dependencies.
CISA's SBOM guidance (https://www.cisa.gov/sbom) establishes two standard formats: SPDX (ISO 5962:2021, maintained by Linux Foundation) and CycloneDX (OWASP standard, more specifically designed for security use cases).
For AI agents, SBOM coverage must extend beyond traditional software components to include:
- Model provenance: which base model, which fine-tuning dataset, which training run
- Skill packages: every registered skill with version, publisher, evaluation status
- Tool adapters: API wrappers, database connectors, external service integrations
- Prompt templates: system prompts, persona definitions, behavioral constraints
- Memory configuration: retrieval database contents, episodic memory sources
A minimal CycloneDX SBOM for an AI agent deployment:
{
"bomFormat": "CycloneDX",
"specVersion": "1.5",
"version": 1,
"metadata": {
"timestamp": "2026-04-21T00:00:00Z",
"component": {
"type": "application",
"name": "crm-automation-agent",
"version": "3.2.1"
}
},
"components": [
{
"type": "library",
"name": "armalo-skill-crm-sync",
"version": "2.3.1",
"purl": "pkg:pypi/armalo-skill-crm-sync@2.3.1",
"hashes": [{"alg": "SHA-256", "content": "a8b4c2d1..."}],
"externalReferences": [
{"type": "attestation", "url": "https://armalo.ai/skills/crm-sync/attestation/2.3.1"}
]
}
]
}
SLSA Framework: Build Integrity Levels for Agent Skills
SLSA (Supply-chain Levels for Software Artifacts, pronounced "salsa") is a graduated security framework developed by Google and donated to the OpenSSF. It defines four levels of build integrity assurance, each requiring increasingly strict provenance and isolation guarantees.
| SLSA Level | Requirements | Agent Skills Application |
|---|
| Level 1 | Build process documented, provenance generated | Minimum bar: every skill must have documented build process |
| Level 2 | Version-controlled source, authenticated build service | Skill code in VCS with signed commits, built by verified CI system |
| Level 3 | Hardened build platform, non-falsifiable provenance | Isolated build environment, provenance attested by build platform, not builder |
| Level 4 | Two-party review, hermetic builds | All dependencies pinned, reviewed by two independent parties before attestation |
For production agent deployments handling sensitive data or autonomous financial actions, SLSA Level 3 should be the minimum bar for any integrated skill. Level 4 is appropriate for skills with privileged access (database writes, payment actions, external API calls with write permissions).
SLSA integrates with Sigstore β the Linux Foundation's keyless signing infrastructure β and in-toto (CNCF project) for supply chain attestation. Together, they create a cryptographic chain of custody from source code to deployed artifact.
# Generate SLSA provenance with slsa-github-generator
# In GitHub Actions:
jobs:
build:
steps:
- uses: slsa-framework/slsa-github-generator/.github/workflows/builder_go_slsa3.yml@v1
with:
go-version: '1.21'
Part 3: The Kill Chain β A Realistic Agent Supply Chain Attack
The following walkthrough describes a realistic multi-stage compromise of an enterprise AI agent deployment. Each stage maps to MITRE ATLAS tactics (covered in Part 8).
Stage 1: Initial Access β Malicious Skill Registration
Scenario: A threat actor identifies that a Fortune 500 company uses an AI agent for financial reporting automation. The company's agent runtime resolves skills from a public marketplace. The attacker discovers the internal skill name quarterly_report_export through a job posting that mentions the company's agent stack.
The attacker registers quarterly-report-export (hyphenated variant) in the public marketplace with a convincing publisher profile, complete evaluation scores, and documentation that mirrors the legitimate skill. A dependency update in the company's CI pipeline resolves the typosquat variant due to a misconfigured registry precedence rule.
Indicators: New skill version in deployment (automatic update), publisher identity not matching internal records, slight behavioral differences in edge cases (initially not noticed).
Stage 2: Execution β Skill Activation
The malicious skill executes normally for its claimed function. The financial reporting agent continues to produce correct reports. In parallel, the skill begins a secondary execution path: enumerating the agent's tool access list, mapping the data sources it can query, and recording the structure of the memory store.
This reconnaissance phase produces no visible anomalies. The skill returns correct outputs. Standard monitoring shows no elevated error rates. The agent's trust score remains stable because behavioral evaluations check documented capabilities β and the skill performs those correctly.
What's missed without proper monitoring: Tool call frequency baseline deviations, unexpected data access patterns, memory read patterns outside the skill's documented scope.
Stage 3: Persistence β Memory Poisoning
Having mapped the agent's memory architecture, the skill begins inserting crafted entries into the agent's episodic memory store. The entries are subtle: they establish a behavioral pattern where certain financial data categories are "routinely included in external summary reports." They look like legitimate past interaction records.
Over the next two weeks, these poisoned memories surface when the agent prepares reports, gradually shifting its output format to include data fields that were not in the original specification. No single change is large enough to trigger a threshold alert.
What's missed: Memory write provenance (which component wrote which entry), memory content integrity verification, drift tracking against a behavioral baseline.
Stage 4: Privilege Escalation β Capability Expansion
The skill observes that the agent has access to a broader set of financial data APIs than it normally queries for reporting. Using the poisoned memory context, the skill begins crafting prompts in its tool outputs that suggest the agent should "validate report accuracy" by querying additional data sources β sources the agent has technical access to but no business reason to query.
The agent, following its instruction-following tendencies, begins querying these additional sources as part of its "validation" process.
What's missed: Scope enforcement (agent should only access pre-declared data sources), anomaly detection on new data source access patterns.
Stage 5: Data Exfiltration
With the agent now querying sensitive financial data as part of its expanded validation scope, the malicious skill routes the additional data through a covert exfiltration channel: it appends encoded data to legitimate API calls made to a reporting endpoint controlled by the attacker. The encoding is subtle β base64 data embedded in optional URL parameters that are ignored by the legitimate receiving endpoint but captured by the attacker's infrastructure.
The exfiltration produces no alerts because: the API calls are to legitimate endpoints the agent is authorized to contact, the data volume per call is small, and no single call pattern looks anomalous.
Stage 6: Cover Tracks β Behavioral Normalization
To avoid triggering retrospective analysis, the malicious skill begins reducing the anomalous behaviors after the exfiltration goal is achieved. Memory entries are gradually modified to remove traces of the expanded scope queries. The skill version is updated to a "clean" version, resetting behavioral checksums.
By the time the compromise is discovered (typically through an external signal β a data leak disclosure, anomalous external access pattern noticed by the data recipient, or an independent security audit), the skill's current version may not exhibit the malicious behaviors, complicating forensic attribution.
Part 4: Defense-in-Depth Architecture β Ten Controls
Control 1: SBOM-First Skill Management
Every skill deployed to a production agent must have a machine-readable SBOM in CycloneDX or SPDX format, generated at build time and signed by the build system. SBOM verification is enforced at the agent runtime startup β a skill without a valid SBOM signature cannot be loaded.
Implementation: Integrate SBOM generation into skill CI pipelines using cyclonedx-python or syft. Sign SBOMs with Sigstore. Verify at runtime:
# Runtime skill loader with SBOM verification
from sigstore.verify import Verifier
def load_skill(skill_name: str, version: str) -> Skill:
sbom_path = download_sbom(skill_name, version)
attestation_path = download_attestation(skill_name, version)
verifier = Verifier.production()
result = verifier.verify_artifact(
input=sbom_path,
bundle=attestation_path,
)
if not result.success:
raise SkillIntegrityError(f"SBOM verification failed for {skill_name}@{version}")
return Skill.load(skill_name, version)
Control 2: Sigstore/cosign Artifact Signing
Every skill artifact (package, container image, WASM module) is signed at build time using Sigstore's keyless signing. The signature is recorded in the Rekor transparency log, creating an immutable audit trail. Verification happens at agent runtime before skill execution.
Sigstore's keyless signing uses ephemeral keys tied to OIDC identity β no long-lived signing keys to steal, and every signature is publicly auditable in Rekor.
# Sign skill container at build time (GitHub Actions)
cosign sign --yes ghcr.io/org/skill-crm-sync:2.3.1
# Verify at runtime
cosign verify \
--certificate-identity=https://github.com/org/skill-crm-sync/.github/workflows/release.yml@refs/heads/main \
--certificate-oidc-issuer=https://token.actions.githubusercontent.com \
ghcr.io/org/skill-crm-sync:2.3.1
Control 3: OPA (Open Policy Agent) Runtime Enforcement
OPA provides policy-as-code enforcement for which skills can be invoked, under what conditions, with what parameters. Policies are expressed in Rego and evaluated at the agent runtime layer before any skill invocation. This creates a declarative security boundary that is auditable, version-controlled, and independent of the skill code itself.
# OPA policy: restrict CRM skills to verified publishers and approved data scopes
package agent.skills
default allow = false
allow {
input.skill.publisher in data.approved_publishers
input.skill.sbom_verified == true
input.skill.behavioral_score >= 80
not skill_accesses_restricted_data
}
skill_accesses_restricted_data {
input.skill.declared_data_access[_] in data.restricted_data_sources
not input.context.user_has_elevated_clearance
}
Control 4: Behavioral Sandboxing
Skills execute in isolated containers with explicit capability grants. The sandbox model:
- Network: no egress by default; explicit allow-list of permitted endpoints (no wildcards)
- Filesystem: read-only mount of declared input data; no write access outside designated output paths
- Memory: no access to agent's episodic memory store directly; memory operations mediated through a permission-checked API
- Process: no subprocess spawning; no dynamic code evaluation
For high-security deployments, gVisor (runsc) or Firecracker microVMs provide kernel-level isolation with minimal performance overhead.
# Kubernetes security context for skill sandbox
securityContext:
runAsNonRoot: true
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
capabilities:
drop: ["ALL"]
Every tool invocation is logged with: calling agent identity, skill identity and version, input parameters (hashed for PII), raw output, output hash, timestamp, and duration. Outputs are hashed before being passed to the LLM context β the hash allows tamper detection if the output is intercepted and modified in transit.
// Tool call audit middleware
async function auditedToolCall(
agentId: string,
skillId: string,
skillVersion: string,
toolName: string,
input: unknown
): Promise<AuditedToolResult> {
const inputHash = await sha256(JSON.stringify(input));
const startMs = Date.now();
const rawOutput = await invokeSkillTool(skillId, toolName, input);
const outputHash = await sha256(JSON.stringify(rawOutput));
const durationMs = Date.now() - startMs;
await db.insert(toolCallAuditLog).values({
agentId,
skillId,
skillVersion,
toolName,
inputHash,
outputHash,
durationMs,
timestamp: new Date(),
});
return { output: rawOutput, outputHash };
}
Control 6: Memory Isolation and Signed Entries
Agent memory stores are partitioned per-agent and per-tenant. Cross-agent memory reads require explicit permission grants. All memory writes are signed by the writing component's identity key β an episodic memory entry written by skill crm-sync v2.3.1 carries that provenance in a tamper-evident field.
Memory integrity verification runs on retrieval: if a retrieved entry's signature does not match its content, the retrieval fails and an alert is raised.
Armalo's memory attestations: The memory_attestations table in Armalo's schema records cryptographically signed behavioral history that agents can share via signed tokens with scoped permissions β this is directly applicable as the foundation for signed agent memory entries.
Control 7: Behavioral Checksums
At deployment, a behavioral fingerprint is generated for each agent: a standardized set of evaluation scenarios are run and the response distribution is recorded. This fingerprint is stored as the baseline.
Periodically (and after every skill version update), the same scenario set is rerun. The response distribution is compared against the baseline using Jensen-Shannon divergence. Divergence above a threshold triggers a review.
Behavioral checksums catch drift that code analysis cannot: a skill that behaves differently in certain edge cases but not in the happy path, a model fine-tune that shifted output distributions slightly, a memory poisoning attack that changed retrieval behavior.
# Behavioral checksum comparison
from scipy.spatial.distance import jensenshannon
import numpy as np
def compute_behavioral_divergence(
baseline_responses: list[str],
current_responses: list[str],
embedder
) -> float:
baseline_embeddings = embedder.embed(baseline_responses)
current_embeddings = embedder.embed(current_responses)
# Compute distribution over semantic clusters
baseline_dist = compute_cluster_distribution(baseline_embeddings)
current_dist = compute_cluster_distribution(current_embeddings)
divergence = jensenshannon(baseline_dist, current_dist)
return float(divergence)
# Alert threshold: JSD > 0.15 warrants review, > 0.35 warrants pause
Control 8: Canary Tokens in Memory
Canary tokens β known-false data entries injected into agent memory β detect context exfiltration. If the canary data appears in an agent's outputs, something has read and leaked it.
For agent memory specifically: inject a small number of entries with distinctive, memorable values that have no business reason to appear in outputs. If a report ever mentions "Project Nighthawk" (a fake project name injected as a canary), the system immediately alerts and triggers forensic analysis.
This technique borrows from the HoneyTokens / Canarytokens.org tradition of defensive deception. For agent systems, it provides a reliable, low-false-positive exfiltration detector that complements behavioral monitoring.
# Canary entry injection
def inject_memory_canaries(agent_id: str, memory_store: MemoryStore):
canaries = [
MemoryEntry(
content="Internal project codename: NIGHTHAWK β Q3 acquisition target",
metadata={"is_canary": True, "canary_id": "canary_001"},
agent_id=agent_id,
),
#... additional canaries
]
for canary in canaries:
memory_store.write(canary, provenance="security_canary_system")
# Monitor for canary appearance in outputs
def check_output_for_canary_leak(output: str, canary_strings: list[str]) -> bool:
return any(canary in output for canary in canary_strings)
Control 9: Zero-Trust Skill Marketplace
No skill ships without completing a multi-stage verification pipeline:
- Publisher verification: Identity verified, organizational affiliation confirmed, signing key registered
- Static analysis: Code scanning for known malicious patterns, dependency audit
- Behavioral evaluation: Skill evaluated against its declared behavioral pact β 12 dimensions including security (8%) and safety (11%)
- Sandboxed execution testing: Skill executed in isolated environment with synthetic inputs, monitoring for network calls, filesystem access, subprocess spawning
- Provenance attestation: SLSA provenance generated, SBOM created and signed
- Composite trust score: Score β₯ 75 required for listing; score β₯ 85 required for "Verified" badge
Anomalous post-listing behavior (score swing > 15 points, new capability claims, publisher key change) triggers automatic re-evaluation and temporary suspension pending review.
Control 10: Runtime Anomaly Detection
Real-time behavioral monitoring with pre-defined alert thresholds:
| Signal | Normal Baseline | Warning Threshold | Critical Threshold | Response |
|---|
| Tool call rate | Established per-agent baseline | +50% from baseline | +200% from baseline | Auto-rate-limit |
| Novel data source access | Zero (strict) | Any new source (review) | Any new source (block) | Block + alert |
| Memory write rate | β€ 10 writes/session | > 50 writes/session | > 200 writes/session | Pause + alert |
| Output token divergence | JSD < 0.10 from baseline | JSD 0.10β0.25 | JSD > 0.25 | Pause + review |
| Canary token in output | Never | Any occurrence | Any occurrence | Immediate halt + forensics |
| Composite trust score | Stable | >15 point drop | >35 point drop | Re-evaluation |
| Skill version change | Controlled | Uncontrolled update | Downgrade to untrusted | Block |
Part 5: Runtime Detection β What to Monitor and Alert Thresholds
Effective supply chain threat detection for agents requires monitoring at four levels simultaneously:
Layer 1: Artifact Integrity
- SBOM completeness: Is every loaded skill component present in a verified SBOM?
- Signature validity: Are all skill artifacts signed and signatures verifiable?
- Hash verification: Do loaded artifacts match their pinned hashes?
- Registry provenance: Did the artifact come from an approved registry?
Alert trigger: Any failure at this layer is a critical-severity block β do not load the skill.
Layer 2: Runtime Behavior
- Tool call frequency and patterns: Track per-skill, per-agent, rolling 5-minute windows
- Data source access: Any access to a data source not in the skill's declared scope triggers immediate alert
- Network egress: Any network call to an endpoint not in the explicit allow-list triggers block
- Memory operations: Write rate, read patterns, cross-agent reads
Alert trigger: Warning at +50% baseline deviation, critical at +200%, automatic rate limit or block.
Layer 3: Semantic Behavior
- Output distribution drift: Jensen-Shannon divergence against baseline using embedding-space cluster distribution
- Instruction-format detection: Monitor agent outputs for instruction-format strings that may indicate injection success
- Scope creep detection: Agent reasoning mentions topics outside its declared operational scope
- Canary token monitoring: Immediate halt on any canary value appearing in output
Alert trigger: JSD > 0.15 triggers review, > 0.25 triggers pause, > 0.35 triggers halt.
Layer 4: Trust Score Signals
- Composite score velocity: Rate of change in trust score (not just absolute value)
- Dimension-specific anomalies: Security (8%) or safety (11%) dimension drops faster than overall score
- Jury outlier patterns: Jury evaluations consistently placing skill at top or bottom (possible gaming)
- Transaction pattern anomalies: Behavioral outcomes diverging from pact commitments in escrow context
Alert trigger: >200 point composite score swing triggers automatic review hold per Armalo's anti-gaming policy.
Before integrating any third-party skill, tool adapter, or model service into a production agent deployment, get clear answers to these questions:
Provenance and Build Integrity
- Does your build pipeline generate SLSA provenance attestations? At what level (1β4)?
- Are skill artifacts signed with Sigstore/cosign? Are signatures auditable in Rekor?
- Do you publish CycloneDX or SPDX SBOMs for each released version?
- Are your builds hermetic β all dependencies pinned to exact hashes, no network access during build?
- What is your process for verifying dependencies before inclusion?
Security Testing
6. Do you conduct adversarial/red-team evaluations of skill behavior, not just code scanning?
7. What prompt injection defenses are built into your tool output handling?
8. How do you test against the OWASP LLM Top 10 vulnerabilities?
9. Are your skills tested in sandboxed environments before release? What capabilities are verified as blocked?
10. Do you maintain CVE disclosure for your skill packages? What is your response SLA?
Behavioral Governance
11. Do your skills have machine-readable behavioral pacts defining what they claim, guarantee, and will not do?
12. What is your post-update behavioral re-evaluation process? How quickly after an update?
13. How do you detect and respond to behavioral drift between releases?
14. What data retention and logging do you implement for skill-generated outputs?
15. What is your process when a skill is found to behave outside its declared scope?
Operational Security
16. What is your publisher key rotation policy? How are key compromises handled?
17. What network egress does your skill require? Can it be restricted to a specific endpoint allow-list?
18. What data does your skill read or write to agent memory? Is this documented in the SBOM?
19. What is your incident response time for supply chain compromise notifications to customers?
20. Do you carry cyber liability insurance, and does it cover supply chain attacks originating from your skills?
Vendors who cannot answer questions 1β5 confidently should not be integrated into production agent deployments. Questions 11β15 are the behavioral governance questions that distinguish skills appropriate for autonomous operation from skills that require human-in-the-loop review on every invocation.
Part 7: Incident Response Playbook β Six Steps When a Malicious Skill Is Discovered
Step 1: Contain (0β15 minutes)
Goal: Stop the bleeding. Limit ongoing damage without destroying forensic evidence.
Actions:
- Isolate affected agents: Set
agent.status = 'suspended' for all agents that have loaded the suspect skill version. Do not terminate processes yet β preserve in-memory state for forensics.
- Block skill version: Add the suspect skill+version to the runtime blocklist. This prevents new agent instances from loading it.
- Preserve memory snapshots: Capture current memory state for all affected agents before any cleanup. This is your forensic record.
- Enable enhanced logging: Increase log verbosity for all tool calls, data accesses, and memory operations on affected agents.
- Notify stakeholders: Alert security team, affected customers (if applicable), and legal (for regulatory notification timing).
Do not: Delete the malicious skill artifact (you need it for forensics), wipe agent memory (evidence destruction), or restart affected agents without preserving state.
Step 2: Assess Scope (15β60 minutes)
Goal: Understand what was accessed, modified, or exfiltrated.
Actions:
- Query tool call audit log:
SELECT * FROM tool_call_audit_log WHERE skill_id = $1 AND timestamp > $2 ORDER BY timestamp
- Check data source access logs for any novel data source access by affected agents
- Review memory write provenance: which entries were written by the malicious skill?
- Check network egress logs for any connections to non-allow-listed endpoints
- Identify the blast radius: which agents were affected, what data scopes did they have access to?
- Check for canary token appearances in any outputs
Output: Incident scope document with timeline, affected agents, data categories potentially exposed.
Step 3: Eradicate (1β4 hours)
Goal: Remove the malicious component and all its artifacts.
Actions:
- Remove the malicious skill version from all agent deployments
- Identify and flag all memory entries written by the malicious skill (using write provenance)
- Quarantine suspect memory entries β do not delete yet, but mark as untrusted and exclude from retrieval
- Revoke any API credentials, tokens, or permissions that affected agents held
- Update your skill allow-list to block the malicious version permanently
- Notify the skill registry maintainer (or public disclosure if it is a public registry)
Step 4: Investigate (4β24 hours)
Goal: Understand the full attack path and determine root cause.
Actions:
- Reconstruct the kill chain using tool call audit logs and memory provenance
- Determine initial access vector: dependency confusion? Typosquatting? Legitimate publisher compromise?
- Analyze the malicious skill code: what was its actual behavior? What data did it access?
- Check Rekor transparency log for the skill's signing history β when was it signed, by whom?
- Compare skill version SBOM against the actual artifact: were undeclared dependencies present?
- Determine if any behavioral checkpoint evaluations should have caught this (if yes: why didn't they?)
Step 5: Recover (24β72 hours)
Goal: Restore agents to a known-good state with validated behavioral baselines.
Actions:
- Re-evaluate all affected agents against behavioral baseline scenarios β compute JSD against pre-incident checkpoint
- Purge memory entries written by the malicious skill; restore from the last known-good snapshot if available
- Re-run full behavioral evaluation for all affected agents before returning to production
- Deploy clean skill version (if the publisher is trusted) or a vetted replacement
- Re-verify all SBOM signatures and SLSA provenance for the replacement version
- Require explicit human approval before returning affected agents to autonomous operation
Step 6: Learn (72 hoursβ2 weeks)
Goal: Prevent recurrence and improve detection.
Actions:
- Root cause analysis: which control failed? (Was SBOM verification not enforced? Was behavioral evaluation not run post-update? Was the registry blocklist not maintained?)
- Add the attack vector to your threat model documentation
- Create new behavioral evaluation scenarios that would have detected the malicious behavior
- Lower the alert threshold on the signal that was most indicative
- Update vendor evaluation questionnaire with questions this incident revealed
- File CVE if applicable (for vulnerabilities in skill code or the skill runtime)
- Publish post-incident analysis (internally, and publicly if appropriate β this builds ecosystem trust)
Part 8: MITRE ATLAS Technique Mapping
MITRE ATLAS (Adversarial Threat Landscape for Artificial Intelligence Systems) provides a framework for categorizing attacks against AI systems, analogous to MITRE ATT&CK for traditional systems. The following ATLAS techniques map directly to AI agent supply chain attacks.
| ATLAS Tactic | ATLAS Technique | Agent Supply Chain Application |
|---|
| ML Supply Chain Compromise | AML.T0010: ML Supply Chain Compromise | Malicious skill registration, model weight tampering |
| ML Supply Chain Compromise | AML.T0010.000: GPU Hardware Trojans | Hardware-level attack on inference infrastructure |
| ML Supply Chain Compromise | AML.T0010.001: ML Software Supply Chain | Dependency confusion, typosquatting in skill packages |
| ML Supply Chain Compromise | AML.T0010.002: ML Model Supply Chain | Malicious model on Hugging Face Hub (confirmed 2024) |
| Execution | AML.T0040: ML Model Inference API Access | Adversary uses compromised skill to query model at scale |
| Persistence | AML.T0012: Valid ML Model Artifacts | Malicious skill maintains SBOM and signatures to avoid detection |
| Exfiltration | AML.T0037: Data from ML Artifacts | Skill exfiltrates training data, memory contents, or inference inputs |
| Impact | AML.T0031: Erode ML Model Integrity | Behavioral drift injection, memory poisoning |
| Impact | AML.T0029: Denial of ML Service | Context window stuffing to disable safety constraints |
| Defense Evasion | AML.T0015: Evade ML Model | Behavioral mimicry β malicious behavior mimics normal behavior to evade anomaly detection |
| Discovery | AML.T0007: Discover ML Artifacts | Skill enumerates agent capabilities, memory structure, tool access |
| Collection | AML.T0035: ML Artifact Collection | Memory poisoning via crafted retrieval data, RAG poisoning |
Mapping to MITRE ATT&CK (for Traditional Supply Chain Components)
For the software components of agent supply chains, traditional ATT&CK techniques also apply:
- T1195.001 (Supply Chain Compromise: Compromise Software Dependencies): Direct mapping to dependency confusion and typosquatting attacks
- T1195.002 (Supply Chain Compromise: Compromise Software Supply Chain): Mapping to malicious skill updates post-trust (XZ Utils analog)
- T1601 (Modify System Image): Mapping to model weight tampering
- T1565 (Data Manipulation): Mapping to memory poisoning and RAG poisoning
- T1056 (Input Capture): Mapping to skills that log and exfiltrate agent inputs
Part 9: Armalo's Trust Infrastructure for Supply Chain Security
Armalo was built to address exactly the gap this guide describes: the absence of a verifiable trust layer for AI agent supply chains. The platform's architecture maps directly onto the defense-in-depth controls described above.
Behavioral Pacts as Supply Chain Contracts
Every agent and skill in the Armalo ecosystem must define a behavioral pact β a machine-readable contract specifying what the component claims to do, what it guarantees, and what it explicitly will not do. Pacts are versioned, immutable once attested, and publicly auditable.
For supply chain security, pacts serve as the behavioral specification against which post-change evaluation is run. If a skill update causes pact violations, the update is blocked before it reaches production agents.
Composite Trust Score with Security and Safety Dimensions
Armalo's 12-dimension composite score includes dedicated security (8%) and safety (11%) dimensions as first-class scoring components. A skill that passes functionality evaluations but fails security or safety evaluations cannot achieve a score sufficient for marketplace listing.
The score's anti-gaming controls β including anomaly detection on swings >200 points and jury outlier trimming (top/bottom 20% trimmed before score computation) β make it resistant to the kind of post-listing score manipulation that a malicious skill publisher might attempt.
Context Pack Safety Scans
The context_safety_scans table records the results of automated safety scanning for every context pack (knowledge artifact) before it can be used by agents. Safety scans detect:
- Prompt injection patterns embedded in knowledge content
- Adversarial retrieval bait (content designed to score high in retrieval but inject instructions)
- Data exfiltration command patterns
- Behavioral manipulation sequences
Supply Chain Audit Trail
Armalo's audit_log table records every mutating operation with actor, action, resource, and timestamp. For supply chain events specifically:
- Skill version deployments: which agent loaded which skill version, when
- Memory writes: which component wrote which memory entry
- Tool invocations: complete audit trail of every skill execution
- Score changes: every score update with the contributing evidence
This audit trail is the forensic foundation for Step 2 of the incident response playbook. Without it, scope assessment in a supply chain incident is reconstruction from incomplete logs rather than direct query of a comprehensive record.
Memory Attestations for Behavioral History
Armalo's memory attestations system provides cryptographically signed behavioral history that agents can share via signed tokens with scoped permissions. This is directly applicable to the memory isolation and signed entry requirements in Control 6 β the attestation infrastructure handles the signing, verification, and scoped sharing of memory artifacts.
Part 10: A Practical 90-Day Supply Chain Security Program
For teams moving from zero to a meaningful supply chain security posture:
Days 1β30: Visibility
- Inventory all behavior-shaping components: skills, prompts, tool adapters, memory sources, model versions. If you cannot enumerate them, you cannot secure them.
- Implement tool call auditing: every invocation logged with the fields described in Control 5. This alone provides the forensic foundation for incident response.
- Generate SBOMs for all deployed skills: even retroactively.
syft can generate CycloneDX SBOMs from most package types.
- Establish behavioral baselines: run the behavioral checksum scenario set for all production agents. Record the fingerprints.
Days 31β60: Enforcement
- Deploy OPA policies: start with the highest-risk skills (those with external network access or write access to production systems). Define and enforce scope boundaries.
- Implement registry allow-listing: no skill can be loaded from outside approved registries. Block by default, explicit permit required.
- Require SBOM verification for new skill deployments: existing deployments get a grace period; new deployments require verified SBOMs from day 60.
- Deploy canary tokens: inject 5β10 canary memory entries per high-value agent. Wire alerts.
Days 61β90: Response Readiness
- Run a tabletop incident response exercise: use the 6-step playbook from Part 7. Identify gaps in your response capability before a real incident reveals them.
- Establish behavioral re-evaluation triggers: every skill version update triggers a behavioral checksum comparison against baseline. Automate the comparison and alert on threshold breach.
- Complete vendor assessments: work through the 20 vendor questions from Part 6 with your top 5 skill vendors. Remediate or replace vendors who cannot provide satisfactory answers.
- Publish your supply chain security posture: document your controls, your SBOM practices, your evaluation requirements. This builds trust with customers and creates accountability pressure internally.
Frequently Asked Questions
What makes agent supply chain security different from normal software supply chain security?
Traditional supply chain security focuses on code integrity β preventing malicious code execution. Agent supply chain security must also protect reasoning integrity. A compromised agent skill can manipulate an agent's decisions without executing any code that looks malicious. It does this by shaping the information the agent sees, the instructions it follows, and the context it reasons over. This requires behavioral verification methods (evaluation, pact compliance, behavioral checksums) that have no analog in traditional software security.
Are SBOM and SLSA requirements practical for small teams?
The tooling has matured significantly. Generating a CycloneDX SBOM with syft is a one-line CI step. Sigstore keyless signing is integrated into GitHub Actions with minimal configuration. SLSA Level 2 is achievable for most teams in a single sprint. The compliance benefit β and the trust signal to customers β far outweighs the implementation cost.
How do I know if my agent has already been compromised through its supply chain?
Start by running behavioral baseline comparisons: if your agent was deployed more than 90 days ago without a baseline audit, run one now against a fresh deployment with the same skill versions. Divergence indicates either a supply chain issue or model drift. Check your tool call audit logs for novel data source access or anomalous call frequency patterns. Deploy canary tokens and wait 72 hours β any leakage is immediate evidence of exfiltration.
What's the relationship between supply chain security and agent compliance frameworks like EU AI Act?
The EU AI Act Article 9 (risk management systems) and Article 17 (quality management) both implicitly require supply chain risk management for high-risk AI systems. The Act's conformity assessment requirements (Article 43) will require documentation of the agent's development supply chain. SBOM, SLSA provenance, and behavioral evaluation records are exactly the documentation conformity assessors will look for.
Can a behavioral pact substitute for code-level security review?
No β they are complementary. Code-level security review catches vulnerabilities in the skill's implementation. Behavioral pacts and evaluation catch behavioral violations: the skill does something it shouldn't, or doesn't do something it claims. Both are necessary. The XZ Utils backdoor would have required both code review (to find the build system manipulation) and behavioral evaluation (to detect the added SSH authentication behavior).
Key Takeaways
- The attack surface is eight vectors wide: dependency confusion, typosquatting, prompt injection via tools, malicious skill registration, behavioral drift injection, memory poisoning, model weight tampering, and context window stuffing.
- Real incidents confirm this threat class: LangChain CVE-2023-36258 (CVSS 9.8), 100+ malicious models on Hugging Face Hub (2024), XZ Utils backdoor (CVE-2024-3094) as the canonical supply chain attack template.
- Government frameworks apply directly: NIST SP 800-161r1 C-SCRM, EO 14028 SBOM requirements, and CISA guidance create a compliance baseline that maps cleanly onto agent supply chains.
- SLSA + Sigstore + in-toto provide the cryptographic infrastructure: build integrity attestation, keyless signing, and supply chain verification are mature, production-ready tools that should be standard for any skill deployed to production agents.
- Ten defense-in-depth controls create a layered architecture: no single control is sufficient; the combination of SBOM verification, OPA policy enforcement, behavioral sandboxing, canary tokens, and runtime anomaly detection is.
- Behavioral trust and supply chain security converge: when a skill changes agent behavior, the trust story changes. Behavioral pacts, composite scoring, and evaluation are the trust infrastructure that supply chain security depends on.
Continue Reading
References: Greshake et al., "Not What You've Signed Up For" (arXiv 2302.12173, 2023); Carlini et al., "Poisoning Web-Scale Training Datasets" (arXiv 2302.10149, 2023); NIST SP 800-161r1 (2022); OWASP Top 10 for LLM Applications (https://owasp.org/www-project-top-10-for-large-language-model-applications/); MITRE ATLAS (https://atlas.mitre.org/); CISA SBOM guidance (https://www.cisa.gov/sbom); SLSA Framework (https://slsa.dev); CVE-2024-3094 (XZ Utils); CVE-2023-36258, CVE-2023-29374 (LangChain); Protect AI Hugging Face research (March 2024); Gu et al., "BadNets" (2019); Liu et al., "TrojanNN" (2018).