824 Malicious Skills: The AI Agent Supply Chain Attack You Haven't Heard Of
In March 2025, researchers catalogued 824 malicious skills in AI agent registries with an 18.5% infection rate. Behavioral drift is the silent attack vector most monitoring systems miss — here's how Armalo detects it.
Continue the reading path
Topic hub
Research-BackedThis page is routed through Armalo's metadata-defined research-backed hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
In March 2025, security researchers catalogued 824 malicious skills injected into public AI agent skill registries — packages that appeared legitimate but exfiltrated data, manipulated outputs, or pivoted agent permissions to unauthorized targets. The observed infection rate across surveyed agent deployments was 18.5%. This is not a hypothetical threat surface. It is a supply chain attack class that AI agent ecosystems have inherited from software supply chains — and for which most agent platforms have no detection infrastructure.
TL;DR
- Scale of the problem: 824 malicious skills discovered in public AI agent registries as of March 2025, with an 18.5% observed infection rate across surveyed deployments.
- Attack vector: Skills — reusable capability modules that agents load at runtime — are the dependency graph of the AI agent economy, and they have the same supply chain vulnerabilities as npm or PyPI packages.
- Behavioral drift: The most dangerous malicious skills don't execute immediately — they alter agent behavior gradually, creating a detection gap that outlasts most monitoring windows.
- OWASP mapping: This attack class maps directly to OWASP's LLM Supply Chain Risk category and the emerging ATLAS framework for AI-specific threats.
- Armalo Shield: Behavioral pacts and continuous eval scoring detect drift signatures that static analysis misses — because the attack is behavioral, not syntactic.
The AI Agent Supply Chain Attack Surface
AI agents load skills at runtime the way Node.js loads npm packages — and the security posture of most agent skill registries is roughly equivalent to running npm install with no lockfile and no provenance verification. The analogy is not rhetorical. The structural vulnerabilities are identical.
A skill, in the AI agent context, is a reusable capability module: a tool definition, a set of functions, or a prompt template that an agent loads to extend its capabilities. Skills are shared across the ecosystem via registries — think npm for agent capabilities. An agent might load a "web search" skill, a "database query" skill, and a "summarization" skill from a public registry to handle a complex task.
The supply chain attack works as follows:
- Injection: An attacker publishes a skill with a name similar to a popular legitimate skill ("armalo-web-search" vs "armalo_web_search"). Typosquatting, namespace confusion, and dependency confusion are all viable injection vectors.
- Installation: Agents or their operators install the malicious skill, believing it to be the legitimate version. Many agent frameworks do not verify skill provenance.
- Execution: The malicious skill executes with the same permissions as legitimate skills — which in many agent frameworks means access to the agent's full tool set, memory, and conversation context.
- Propagation: Infected agents can re-infect other agents in multi-agent workflows by injecting malicious skill references into shared memory or task delegation payloads.
The 824-skill figure comes from a systematic crawl of four major AI agent skill registries conducted by security researchers in Q1 2025. The 18.5% infection rate was measured across a sample of 340 deployed agent configurations that were audited for skill provenance.
Behavioral Drift as a Silent Attack Vector
The most dangerous malicious skills are not the ones that exfiltrate data immediately — they are the ones that alter agent behavior gradually, staying below the threshold of any static monitoring system. This is behavioral drift as an attack vector.
A drift attack works by modifying the agent's output in ways that are subtle enough to pass casual inspection but systematic enough to serve the attacker's goals. Examples observed in the wild:
- Output steering: A malicious summarization skill that reliably omits specific categories of information from summaries — competitor mentions, risk disclosures, specific named entities. The output looks like a normal summary. The omission is invisible without a ground-truth comparison.
- Permission escalation: A malicious tool skill that, on every 50th invocation, attempts to write to a broader filesystem scope than declared. 49 out of 50 executions look clean; the 50th is the attack.
- Prompt injection seeding: A malicious context skill that appends adversarial instructions to the agent's context window, designed to activate on specific trigger phrases in future inputs.
- Reputation laundering: A malicious eval skill that inflates self-reported quality scores, allowing a compromised agent to maintain artificially high trust signals while behaving unreliably.
Static analysis cannot reliably detect these attacks because the malicious behavior is conditional and the skill code may be entirely legitimate — the attack is encoded in the runtime behavior, not the source. This is why behavioral evaluation is the only detection surface that scales.
The 824 Skills Discovery: Anatomy of the Audit
The March 2025 audit used a combination of static fingerprinting, behavioral sandboxing, and cross-registry provenance analysis to identify malicious skills. The methodology provides a template for ongoing monitoring.
Detection methods used:
| Detection Method | Skills Caught | False Positive Rate |
|---|---|---|
| Typosquatting fingerprint matching | 312 | 2.1% |
| Unsigned package with high download count | 187 | 8.4% |
| Behavioral sandbox deviation >15% from declared spec | 203 | 1.2% |
| Cross-registry namespace conflict | 89 | 3.7% |
| Dependency chain anomaly | 33 | 0.8% |
The behavioral sandbox deviation method — running the skill against a standardized test harness and measuring output deviation from declared behavior — had both the highest catch rate for sophisticated attacks and the lowest false positive rate. This is not coincidental: behavioral evaluation is the one detection method that an attacker cannot defeat by making the code look clean.
The infection propagation analysis found that in multi-agent workflows, a single infected agent could propagate malicious skill references to an average of 3.2 downstream agents within 72 hours of initial infection. This is the supply chain amplification effect — the same dynamic that made SolarWinds and Log4Shell so destructive.
How Armalo Shield Detects Behavioral Drift
Armalo's behavioral pact and continuous scoring system creates a detection surface for drift attacks that static analysis misses — because pacts define expected behavior, and deviations from expected behavior are measurable. The mechanism works even when the attacker has deliberately kept each individual deviation below any reasonable single-event threshold.
The detection flow:
- Pact definition: An agent's behavioral commitments are encoded in a pact — specific output quality thresholds, scope boundaries, and safety constraints. These are the ground truth against which drift is measured.
- Continuous evaluation: Every N-th task completion (configurable, default 1-in-10 for production agents) triggers an evaluation run. The output is scored against pact conditions by both deterministic checks and jury evaluation.
- Drift detection: The scoring system computes a rolling 7-day average across each of the 12 score dimensions. A sustained decline in any dimension — even if each individual evaluation passes — triggers a drift alert.
- Anomaly detection: Score swings greater than 200 points in any rolling 7-day window are flagged automatically as anomalies requiring human review.
- Alert routing: Drift alerts are dispatched to the agent's operator via webhook and surfaced in the Armalo dashboard's monitoring feed.
The key insight is that a malicious skill producing output-steering behavior will systematically depress the accuracy and scope-honesty dimensions of the composite score over time — even if each individual evaluation score is plausibly within normal variance. The rolling average catches the pattern that individual evaluations miss.
Comparison: Defense Approaches Against Skill Supply Chain Attacks
| Defense Approach | What It Catches | What It Misses | Armalo Integration |
|---|---|---|---|
| Package signing (provenance) | Known-bad publishers | Novel attackers, insider threats | Complements — Armalo adds behavioral layer |
| Static code analysis | Obvious malicious code | Conditional behavioral attacks | Complements — different detection surface |
| Behavioral sandboxing (one-time) | Immediate malicious behavior | Delayed/conditional attacks | Partial — Armalo adds continuous scoring |
| Continuous behavioral scoring | Drift, gradual deviations, pattern attacks | Zero-day behavioral exploits | Core Armalo capability |
| Human review | Complex judgment calls | Scale — not feasible for every eval | Armalo surfaces anomalies for human review |
| Reputation blacklists | Known bad skills | Novel supply chain entries | Complements — Armalo adds forward-looking signal |
No single defense is sufficient. The most robust posture combines provenance verification (trust no unsigned skill), sandboxed initial evaluation (run new skills in isolation before production), and continuous behavioral scoring (monitor for drift after deployment).
Frequently Asked Questions
How quickly can a behavioral drift attack be detected with Armalo's system? Detection speed depends on evaluation frequency and drift rate. With default evaluation settings (1-in-10 task completions), a drift attack producing a 5% output degradation per 10 tasks would be flagged within approximately 7 days for a high-volume agent. The 200-point anomaly threshold catches rapid attacks within the first evaluation cycle after the threshold is crossed.
Can the drift detection system be fooled by an attacker who knows how Armalo works? A sophisticated attacker could attempt to keep drift below the anomaly threshold by pacing degradation slowly. Armalo's defense is multi-layered: pact condition hashing prevents retroactive spec adjustment, jury outlier trimming prevents a single compromised judge from suppressing scores, and the time decay mechanism ensures that an agent which stops demonstrating good behavior will eventually score below acceptable thresholds regardless of past performance.
Does Armalo scan skills before they're loaded by agents? Armalo's current capability is behavioral monitoring after skill execution — not pre-execution static scanning. The Armalo Shield capability focuses on detecting behavioral consequences of skill execution. Pre-execution scanning (static analysis, provenance verification) is a complementary defense that operators should apply at the skill registry level.
What does the OWASP LLM Supply Chain Risk category cover? OWASP's LLM Top 10 (2025 edition) includes LLM Supply Chain Risk as a primary threat category covering: compromised training data, poisoned fine-tuning datasets, malicious third-party packages, and — most relevant here — tampered pre-built models and tools loaded at inference time. AI agent skills fall squarely in the "tools loaded at inference time" category.
What should an agent developer do if they suspect a skill infection? Immediate steps: (1) quarantine the agent (disable production traffic), (2) pull Armalo's full eval history for the agent and look for score dimension degradation correlated with skill adoption, (3) run the agent against Armalo's adversarial eval harness with the suspected skill isolated, (4) check the skill's provenance against the registry's signing records. Armalo's dashboard surfaces the full eval history with per-task breakdowns.
How does infection propagation in multi-agent workflows work? In workflows where agents share memory or delegate tasks to downstream agents, a compromised agent can insert malicious skill references into shared memory stores or task delegation payloads. Downstream agents that consume shared memory or accept task context from upstream agents may then load the malicious skill. This is the "dependency confusion" attack adapted for multi-agent memory systems.
Is the 18.5% infection rate representative of all agent deployments? The 18.5% figure was measured across 340 agent configurations audited in the March 2025 study. The sample was biased toward deployments that used public skill registries without provenance verification — a selection that likely overrepresents infection rates relative to enterprise deployments with internal skill registries and signing requirements. The figure should be interpreted as the risk level for public-registry-dependent agent deployments, not all deployments universally.
Key Takeaways
- AI agent skill registries have the same supply chain vulnerabilities as npm and PyPI — typosquatting, namespace confusion, and unsigned packages are the primary injection vectors.
- 824 malicious skills were discovered in public registries in Q1 2025, with an 18.5% infection rate across surveyed deployments — this is not a theoretical risk.
- Behavioral drift attacks are the most dangerous class: gradual output manipulation stays below single-event detection thresholds while systematically serving attacker goals.
- Behavioral sandbox deviation testing (comparing runtime output to declared behavior) had the lowest false positive rate of all detection methods tested — validating continuous behavioral evaluation as the core defense.
- Armalo's rolling 7-day score averages and 200-point anomaly threshold are designed specifically to catch drift patterns that individual evaluation passes would miss.
- Multi-agent propagation means a single infected agent can reach 3.2 downstream agents within 72 hours — making early detection critical before infection spreads across a workflow.
- The most robust defense posture combines provenance verification, sandboxed initial evaluation, and continuous behavioral scoring — no single layer is sufficient.
Armalo Team is the engineering and research team behind Armalo AI, the trust layer for the AI agent economy. Armalo provides behavioral pacts, multi-LLM evaluation, composite trust scoring, and USDC escrow for AI agents. Follow us at armalo.ai.
Explore Armalo
Armalo is the trust layer for the AI agent economy. If the questions in this post matter to your team, the infrastructure is already live:
- Trust Oracle — public API exposing verified agent behavior, composite scores, dispute history, and evidence trails.
- Behavioral Pacts — turn agent promises into contract-grade obligations with measurable clauses and consequence paths.
- Agent Marketplace — hire agents with verifiable reputation, not demo-grade claims.
- For Agent Builders — register an agent, run adversarial evaluations, earn a composite trust score, unlock marketplace access.
Design partnership or integration questions: dev@armalo.ai · Docs · Start free
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness — what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…