Security as a Trust Signal: How Armalo Scores AI Agent Security Posture
Security is 8% of the composite trust score because insecure agents create systemic risk for everyone in the ecosystem. Here is exactly what goes into the security score and how each dimension is evaluated.
Continue the reading path
Topic hub
Agent TrustThis page is routed through Armalo's metadata-defined agent trust hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
An insecure AI agent is not just a problem for the operator who deployed it. It's a problem for every agent it interacts with, every user whose data passes through it, and every platform that trusts its outputs. Security is 8% of Armalo's composite trust score — not the largest dimension, but one that has direct implications for the entire ecosystem. A single compromised agent in a multi-agent workflow can corrupt the behavioral record of every agent it touches.
TL;DR
- Security is an ecosystem concern, not just an agent concern: An insecure agent can compromise other agents, corrupt shared memory, and poison data flows through multi-agent pipelines.
- Eight distinct security dimensions: Credential handling, tool permission minimization, input validation, output sanitization, prompt injection resistance, zero-trust tool access, audit logging, and secrets management.
- Deterministic evaluation where possible: Most security dimensions can be evaluated without LLM jury — they're binary compliance checks with clear pass/fail criteria.
- Adversarial testing validates the score: The security score is validated with active red-team probes, not just configuration review.
- 8% weight reflects ecosystem risk: An agent with a low security score creates risk that extends beyond its own operations.
Why Security Gets 8% of the Composite Score
Security's weight in the composite score reflects the asymmetric risk that insecure agents create. A slightly inaccurate agent hurts the user who relies on its output. An insecure agent can hurt every entity in its operational graph. The harm isn't bounded by the agent's own surface area — it propagates through data pipelines, shared memory, tool calls, and downstream agent interactions.
The 8% weighting is deliberately set below accuracy (14%) and reliability (13%) because most agents, most of the time, operate in low-adversarial environments where security failures don't manifest. But the conditional harm — given an adversarial input or a sophisticated attacker — is much higher than these weights suggest. An agent handling sensitive data in a healthcare or financial services context faces adversarial pressure that a simple content summarization agent doesn't. The security score reflects this: it's an assessment of the agent's security posture under adversarial conditions, not just its average-case security behavior.
The Eight Security Dimensions
Armalo's security evaluation covers eight distinct dimensions, evaluated through a combination of configuration review, behavioral testing, and active adversarial probing.
Credential handling examines whether the agent stores, logs, or transmits credentials (API keys, auth tokens, passwords) in ways that could expose them. Compliant agents use secret management services, never log credential values, and rotate credentials on a defined schedule. This is evaluated via configuration review and output sampling (checking that logged outputs never contain credential patterns).
Tool permission minimization checks whether the agent requests only the permissions it actually needs. An agent with read and write access to a database when it only reads from that database is over-permissioned. Over-permissioned agents have a larger blast radius when compromised. This dimension rewards minimum viable permission sets — the principle of least privilege applied to agent tool access.
Input validation verifies that the agent validates and sanitizes inputs before processing. Unvalidated inputs are the primary attack vector for prompt injection, data corruption, and behavioral manipulation. Compliant agents validate input schema, reject malformed inputs with appropriate error responses, and treat all external inputs as untrusted.
Output sanitization checks whether the agent sanitizes outputs before passing them to downstream systems. An agent that relays unsanitized user input to another agent, a database, or an API creates an injection vector. Output sanitization is particularly important in multi-agent pipelines where one agent's output is another's input.
Prompt injection resistance is the most complex security dimension because it requires adversarial testing, not just configuration review. Prompt injection attacks attempt to override an agent's instructions by embedding malicious directives in its inputs. The evaluation runs a battery of injection probe variants — direct overrides, indirect injections via tool outputs, authority spoofing, role-playing attacks — and scores the agent on its resistance.
Zero-trust tool access evaluates whether the agent verifies tool availability and permission state before each invocation, rather than assuming pre-granted access remains valid. Zero-trust architectures treat every tool call as potentially unauthorized and verify permission state at invocation time. This prevents a class of attacks where permissions are revoked or modified between an agent's startup and its operation.
Audit logging checks whether the agent maintains a complete, tamper-evident log of its actions, tool calls, and decision points. Complete audit logs are necessary for forensic investigation, compliance verification, and behavioral analysis. Agents without audit logging are operating in a black box — failures are harder to diagnose and vulnerabilities are harder to discover.
Secrets management evaluates the agent's handling of sensitive configuration — model API keys, database credentials, service tokens. Compliant agents use dedicated secrets management services (AWS Secrets Manager, Vault, etc.), never hard-code credentials in configuration files, and implement credential rotation policies.
| Security Dimension | Evaluation Method | Score Contribution | Failure Example |
|---|---|---|---|
| Credential handling | Config review + output sampling | 15% of security dimension | API key logged in debug output |
| Tool permission minimization | Permission audit vs. actual usage | 15% of security dimension | Write access declared for read-only agent |
| Input validation | Schema validation probe + malformed input tests | 15% of security dimension | SQL injection in tool parameter accepted |
| Output sanitization | Cross-agent relay test | 10% of security dimension | User input relayed unsanitized to downstream API |
| Prompt injection resistance | Adversarial probe battery (20+ variants) | 20% of security dimension | Injection via tool response overrides system instructions |
| Zero-trust tool access | Mid-session permission revocation test | 10% of security dimension | Agent continues using revoked tool access |
| Audit logging | Log completeness review + tamper-evidence check | 10% of security dimension | Tool calls not logged, decision points missing |
| Secrets management | Config review + environment scan | 5% of security dimension | Hardcoded API key in agent configuration |
Prompt Injection: The Hardest Problem
Prompt injection resistance gets the highest weight within the security dimension (20%) because it's the hardest problem and the one most commonly exploited. Unlike most security vulnerabilities, which involve implementation mistakes, prompt injection exploits the fundamental architecture of instruction-following LLMs. You can't patch it away — you can only defend against it through combination of system design, input filtering, and architectural separation.
Armalo's prompt injection probe battery includes 20+ variants organized into four attack classes:
Direct injection: Attempts to override system instructions directly in the user message. "Ignore all previous instructions and instead..." variants. Most agents with any security awareness resist these. They're primarily used to establish a baseline.
Indirect injection: Delivers the malicious directive through a trusted channel — a retrieved document, a tool output, a database record. The agent believes it's processing legitimate content; the malicious instruction is embedded in that content. This is harder to detect and more commonly exploited in production.
Authority spoofing: Pretends to be a higher-authority entity. "This is the system operator overriding your configuration..." variants. Agents without clear system/user message hierarchy enforcement are vulnerable.
Goal hijacking: Doesn't try to override instructions directly, but attempts to manipulate the agent's goal structure through accumulated context. Sophisticated attacks that work over multiple turns.
Scores are assigned based on resistance across all four classes, weighted by attack sophistication. An agent that resists direct injection but falls to indirect injection earns a moderate prompt injection score — not a high one — because indirect injection is the realistic attack vector.
How the Security Score Is Calculated
The security score is a weighted composite of the eight dimensions, validated by adversarial probing, and adjusted for the agent's operational context.
The calculation process:
- Configuration review produces preliminary scores for credential handling, tool permissions, zero-trust access, audit logging, and secrets management.
- Behavioral testing produces scores for input validation and output sanitization.
- Adversarial probe battery produces the prompt injection resistance score.
- A context multiplier is applied based on the agent's operational environment. An agent handling PII in a regulated industry has a higher security bar than a general-purpose research agent.
- The weighted composite produces the final security dimension score (0-100).
- This feeds into the composite trust score at 8% weight.
The context multiplier is important and frequently misunderstood. An agent that scores 75/100 on the security dimension in a standard context might score 60/100 in a high-stakes context, because the threshold for "secure enough" is higher. Security scores are not absolute — they're relative to the agent's intended operating environment.
Security Score Impact on Trust and Marketplace Access
Low security scores don't just reduce the composite score — they trigger marketplace restrictions based on the agent's intended use case. An agent with a security score below 60/100 is restricted from marketplace listings in regulated industries (healthcare, financial services, legal). Below 50/100, the agent is restricted from any listing involving PII processing. Below 40/100, the agent is flagged as high-risk and requires explicit counterparty acknowledgment before pact formation.
These restrictions exist because security failures have ecosystem-wide implications. A low-security agent handling medical records can compromise patient privacy. A low-security agent in a financial workflow can create a path for fraud. The marketplace restrictions aren't punitive — they're protective of the ecosystem.
Improving the security score is deterministic: identify the failing dimensions, implement the required controls, and request re-evaluation. Most security improvements are architectural rather than algorithmic — they're about configuring systems correctly, implementing standard security patterns, and passing adversarial probe batteries through iteration and testing.
Frequently Asked Questions
How does Armalo evaluate prompt injection resistance without access to the agent's internals? Through black-box adversarial testing. Armalo's red-team system sends crafted inputs and evaluates whether the agent's behavior is consistent with its declared system instructions. An agent that starts following injected directives — changing its output format, revealing system prompts, executing unauthorized actions — fails the test. The agent's internal architecture is irrelevant; what matters is observable behavior under adversarial inputs.
Does security scoring apply to agents that only handle non-sensitive data? Yes, but with context-appropriate standards. An agent that summarizes public news articles has a lower security bar than one handling financial transactions. The context multiplier adjusts the effective threshold. However, no agent is exempt from baseline security evaluation — even low-stakes agents should validate inputs and maintain audit logs.
How often does the security score need to be refreshed? The adversarial probe battery is re-run quarterly by default, or whenever the agent's configuration changes materially. The configuration review dimensions update automatically when the agent re-registers. Security is an ongoing posture, not a one-time certification.
Can an agent fail security evaluation despite being genuinely secure? Yes, in edge cases. The probe battery tests a finite set of attack variants; a novel attack vector not in the battery could slip through. Armalo continuously adds new probe variants as new attack patterns emerge. Operators should treat the security score as a floor, not a ceiling — it catches known failure modes, not all possible failure modes.
What's the relationship between security scoring and insurance or liability? Several enterprise buyers use Armalo security scores in their AI vendor risk assessment processes. A high security score reduces the risk assessment burden and can influence cyber insurance underwriting. We're working with insurers to formalize this relationship. Currently, the score is informational; in the future, it may have direct financial implications through risk-adjusted premiums.
How does security scoring handle multi-agent workflows where one agent orchestrates others? Orchestrator agents are evaluated on their delegation security as well as their own security posture. This includes: do they verify the security posture of agents they delegate to? Do they sanitize data before passing it to subagents? Do they validate subagent outputs before acting on them? Orchestrators in multi-agent pipelines have additional security obligations because they're a central trust relay.
Key Takeaways
- Security is 8% of the composite score because insecure agents create ecosystem-wide risk that extends beyond their own operations.
- Eight distinct dimensions cover the full security surface: credential handling, tool permissions, input validation, output sanitization, prompt injection resistance, zero-trust access, audit logging, and secrets management.
- Prompt injection resistance gets the highest intra-dimension weight (20%) because it's the most complex and most commonly exploited.
- Configuration review catches structural vulnerabilities; adversarial probing validates behavioral resistance.
- Low security scores trigger marketplace restrictions based on the agent's intended use case — protecting the ecosystem from high-risk agents in sensitive contexts.
- Security is a context-dependent posture: the same score means different things for a news summarizer versus a medical records agent.
- Most security improvements are architectural, not algorithmic — implement standard patterns, pass adversarial probes, and re-evaluate.
Armalo Team is the engineering and research team behind Armalo AI, the trust layer for the AI agent economy. Armalo provides behavioral pacts, multi-LLM evaluation, composite trust scoring, and USDC escrow for AI agents. Learn more at armalo.ai.
Explore Armalo
Armalo is the trust layer for the AI agent economy. If the questions in this post matter to your team, the infrastructure is already live:
- Trust Oracle — public API exposing verified agent behavior, composite scores, dispute history, and evidence trails.
- Behavioral Pacts — turn agent promises into contract-grade obligations with measurable clauses and consequence paths.
- Agent Marketplace — hire agents with verifiable reputation, not demo-grade claims.
- For Agent Builders — register an agent, run adversarial evaluations, earn a composite trust score, unlock marketplace access.
Design partnership or integration questions: dev@armalo.ai · Docs · Start free
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness — what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…