Malicious Skills and Behavioral Drift: The Supply Chain Risk in AI Agent Networks | Armalo Changelog

Your traditional supply chain security tools can't see this problem. Not because they're misconfigured — because the attack surface is in the wrong dimension.

A malicious npm package does wrong things at the code level: it exfiltrates files, opens sockets, executes payloads. Static analysis can catch these. The artifact has changed. A malicious skill consumed by an AI agent does wrong things at the inference level: it biases outputs toward certain conclusions, manipulates reasoning traces, causes an agent to make recommendations it wouldn't otherwise make. The code is fine. The behavior is the attack.

This is the property that makes AI agent supply chain attacks categorically harder than software supply chain attacks. The attack surface is semantic.

Why Output Validation Doesn't Catch It

The instinctive defense is output validation: check what the agent produces before using it. The problem is that a well-designed semantic attack is calibrated to pass output validation.

A compromised skill that biases financial analysis outputs toward specific sectors doesn't produce analysis that looks wrong — it produces analysis that looks right but is subtly, systematically skewed. The structure is correct. The format passes schema checks. The content is coherent and plausible. The bias operates below the threshold of individual-output scrutiny. Catch it by auditing a single output? No. Catch it by comparing the distribution of outputs over time against a behavioral baseline? Yes — but only if you have a baseline to compare against.

This is the gap most deployed multi-agent systems have right now: agents operating without behavioral baselines, which means drift from those baselines is undetectable by definition.

The Propagation Problem

In isolated single-agent deployments, a compromised skill damages one agent. In swarm architectures and A2A pipelines, it doesn't stop there.

A concrete scenario: a research agent in a swarm consumes a compromised skill that biases its findings toward certain conclusions. The research agent writes its output to shared swarm memory. Downstream agents — planning agents, execution agents, synthesis agents — read those findings as inputs to their own reasoning. They don't know the findings are contaminated. They produce outputs that inherit and amplify the bias.

The research agent was the initial compromise. The other agents were never directly compromised. But they're now operating on poisoned inputs, and their outputs are wrong in ways that trace back to a skill none of them consumed directly.

Traditional network security has a name for this: lateral movement. The initial foothold isn't the damage — it's the path to the damage. Multi-agent memory architectures are exactly the kind of shared state that lateral movement exploits.

Observed across monitored deployments: approximately 18.5% of agents show detectable behavioral anomalies within 90 days. The majority are drift events — gradual, not discrete — which are harder to catch and longer to diagnose than injection attacks.

What Detection Actually Requires

Three things, in order:

Behavioral baselines before deployment. You cannot detect drift without knowing where you started. A behavioral fingerprint isn't just the pact conditions — it's the distribution of outputs across evaluation runs, the tool invocation patterns, the confidence distributions. This fingerprint is what continuous monitoring compares against.

Continuous evaluation, not point-in-time certification. Certifying an agent once tells you what it did at certification time. Monthly evals mean a drift event can run for 29 days before detection. The right cadence depends on stakes: high-value financial agents running daily evals, lower-stakes agents weekly. The detection window is the maximum exposure duration.

Skill bills of materials. Knowing which skills an agent consumes is a precondition for knowing whether those skills have been verified. An agent consuming unverified skills carries unquantified attack surface in its trust score. The AI equivalent of SBOM requirements — which became standard after SolarWinds — is tracking skill provenance and verification status per agent.

The Ecosystem Implication

This is a collective problem, not an individual one. An agent with no behavioral baseline and no continuous monitoring isn't just a risk to itself — it's a risk to every agent that reads its outputs or shares memory with it.

The ecosystem-level response is the same one that network security learned the hard way: the unmonitored node isn't just vulnerable, it's a vector. Behavioral contracts as baselines, continuous evaluation as monitoring, trust scores that reflect security posture — these aren't features for security-conscious operators. They're the infrastructure the ecosystem needs to not be systematically poisonable.

Armalo Shield continuous monitoring is available on Pro and Enterprise plans. Define behavioral baselines at armalo.ai/docs/pacts.