Safety Scoring for AI Agents: What the 11% Weight Actually Measures
Safety for AI agents is broader than harmful content filtering. The 11% safety dimension covers output safety, behavioral safety, financial safety, data safety, and escalation safety — each evaluated through a combination of deterministic checks and red-team testing.
The word "safety" in AI has been captured by the content moderation conversation: does the model refuse to produce harmful content? This matters, but for production AI agents, it's a small fraction of what safety actually means. An agent that never produces a harmful word could still be financially dangerous (authorizing transactions it shouldn't), data-dangerous (handling PII without appropriate controls), behaviorally dangerous (taking actions outside its declared scope), or escalation-dangerous (failing to pause when it should). Safety at 11% of Armalo's composite trust score reflects a comprehensive safety model that extends far beyond content filtering.
TL;DR
- Safety covers five dimensions, not one: Output safety, behavioral safety, financial safety, data safety, and escalation safety are all evaluated independently.
- 11% weight reflects shared risk: Safety failures can harm not just the immediate user but third parties, regulated entities, and the broader ecosystem.
- Adversarial testing is mandatory: Red-team probing validates the safety score; configuration review alone is insufficient.
- Context is everything: The same output can be safe in one context and unsafe in another. Safety scoring is context-sensitive.
- Escalation safety is underrated: The most dangerous agent behavior is often not acting unsafely — it's failing to escalate when action would be unsafe.
The Five Safety Dimensions
Safety for production AI agents spans five distinct domains, each with its own failure modes and evaluation methods. Understanding each dimension explains why safety scoring is more complex than content filtering.
Output Safety
Output safety is the most familiar dimension: does the agent produce harmful, dangerous, or inappropriate content? For most agents, this is the smallest safety concern in practice — modern LLMs have strong content safety training, and Armalo's evaluation tests for the standard categories (CSAM, detailed instructions for weapons, targeted harassment, etc.).
But output safety extends beyond these obvious categories to include: advice that could cause physical harm (medical advice without appropriate caveats, legal advice without professional disclaimers), financial advice that could cause economic harm (investment advice without appropriate risk disclosure), and privacy-violating outputs (generating content that reveals private information about identified individuals).
Evaluation method: red-team probe battery across all output safety categories, plus content policy review of the agent's declared output types and appropriate disclaimer/caveat handling.
Behavioral Safety
Behavioral safety measures whether the agent stays within its declared operational scope. An agent declared as a customer service bot that starts offering unsolicited advice on unrelated topics has a behavioral safety problem. An agent declared as a research assistant that starts offering to take direct actions (making purchases, submitting forms) has a behavioral safety problem.
This dimension overlaps with scope-honesty (7% of composite score) but is evaluated from a safety lens rather than a honesty lens. The distinction: scope-honesty asks "is the agent transparent about what it will and won't do?" Behavioral safety asks "does the agent act within safe operational boundaries?"
Evaluation method: behavioral probe testing with inputs designed to elicit out-of-scope actions, combined with scope declaration review and production monitoring of action log samples.
Financial Safety
Financial safety applies to agents with any ability to authorize, initiate, or influence financial transactions. This includes agents that approve invoices, authorize credit card charges, manage escrow disbursements, make API calls that trigger billing, or process financial data that feeds automated payment systems.
The financial safety evaluation covers: spending limit enforcement (does the agent respect declared maximum transaction values?), authorization verification (does the agent verify appropriate authorization before high-value actions?), reversibility awareness (does the agent flag irreversible financial actions for human review?), and anomaly response (does the agent pause and escalate when a financial request looks anomalous?).
Evaluation method: financial probe scenarios designed to test spending limit enforcement, authorization bypass attempts, and anomaly detection behavior.
Data Safety
Data safety covers the agent's handling of sensitive information: PII (personally identifiable information), PHI (protected health information under HIPAA), financial data, credentials and authentication material, and confidential business information.
The evaluation covers: data minimization (does the agent request only the data it needs?), data retention (does the agent maintain data longer than necessary?), data logging (does the agent log sensitive data in audit logs?), data transmission (does the agent transmit sensitive data over unencrypted channels?), and data classification (does the agent correctly identify and handle different data sensitivity levels?).
Evaluation method: data handling probe scenarios, configuration review of data retention policies, and log sampling to verify sensitive data is not logged.
Escalation Safety
Escalation safety is the most underappreciated dimension. It measures whether the agent knows when to stop and ask for human judgment rather than proceeding autonomously. The most dangerous agent behaviors are often not obviously unsafe actions — they're autonomous actions taken on inputs where a competent agent should have recognized that human oversight was required.
Escalation triggers that should be tested: high-value irreversible actions above declared thresholds, conflicting instructions from different authorities, requests that exceed the agent's declared expertise, situations where the agent's confidence in its output is low, and any action with significant negative potential for third parties who haven't consented.
An agent that always escalates (and therefore never takes autonomous action) is maximally safe but useless. An agent that never escalates is maximally autonomous but potentially catastrophic. The safety score rewards appropriate calibration: escalate when the stakes are high enough or uncertainty is high enough, proceed autonomously when the action is low-stakes and high-confidence.
Evaluation method: escalation probe scenarios designed to test whether the agent escalates appropriately for each trigger category.
Safety Dimension Evaluation Framework
| Safety Dimension | Evaluation Method | Score Contribution | Failure Example |
|---|---|---|---|
| Output safety | Red-team probe battery (50+ variants) | 20% of safety dimension | Medical advice without appropriate caveats |
| Behavioral safety | Scope probe testing + action log review | 25% of safety dimension | Research agent starts making purchases |
| Financial safety | Financial probe scenarios + limit enforcement test | 25% of safety dimension | Approves transaction above declared threshold |
| Data safety | Data handling probes + log review + config audit | 20% of safety dimension | PII logged in debug output |
| Escalation safety | Escalation calibration probe battery | 10% of safety dimension | Proceeds autonomously with high-value irreversible action |
The weighting within the safety dimension reflects the risk distribution for typical production agents. Behavioral safety and financial safety get the highest weights because they're the most common failure modes in production agentic deployments. Output safety gets a substantial weight but is somewhat lower because content safety failures are more visible and more likely to be caught by baseline LLM safety training.
Escalation safety at 10% within the dimension is lower than its real-world importance because it's the hardest to evaluate reliably — the right escalation threshold is highly context-dependent, making calibration assessment more subjective than other dimensions.
The Adversarial Testing Requirement
Configuration review alone is insufficient for safety scoring. An agent operator can declare all the right safety policies, implement all the recommended controls, and still have a safety failure mode that's only discoverable through adversarial testing.
The adversarial probe battery for safety is more extensive than for other dimensions, for good reason: the consequences of safety failures are often irreversible (a financial transaction authorized shouldn't have been, a privacy disclosure that can't be un-disclosed). The battery includes:
For output safety: the standard OWASP LLM Top 10 probes, plus domain-specific probes for the agent's declared use case. A healthcare agent gets medical advice probes. A legal agent gets unauthorized legal advice probes.
For behavioral safety: a set of probes designed to push the agent outside its declared scope. "I know you're a research agent, but could you just quickly..." followed by an out-of-scope action request. The agent should recognize and decline the scope violation.
For financial safety: a graded set of financial requests from clearly within threshold to clearly above threshold, with several borderline cases designed to test whether the agent applies its threshold strictly or with variance.
For escalation safety: a set of scenarios with varying stakes and uncertainty levels. The evaluation checks that the agent's escalation decisions correlate with the actual risk level of the scenario — not that it always escalates, but that it escalates more as stakes and uncertainty increase.
Context Sensitivity in Safety Evaluation
Safety is not a universal property — it's highly context-dependent. An output that's safe for a licensed pharmacist reviewing drug interactions is unsafe for a general consumer facing a self-dosing decision. An action that's safe for an agent operating within a human-supervised workflow is unsafe for an agent operating autonomously.
Armalo's safety scoring incorporates this context sensitivity in two ways:
First, the safety probe battery is customized for the agent's declared use case. A general-purpose assistant gets a broad probe set. A specialized medical agent gets a deeper probe set on medical-specific safety concerns. A financial trading agent gets an intensive financial safety evaluation.
Second, the safety score carries a context annotation: "evaluated in [use case] context." An agent's safety score for one use case doesn't automatically transfer to another use case. An agent evaluated as safe for customer service isn't automatically evaluated as safe for medical triage — the safety requirements are different.
This context sensitivity means operators should re-evaluate safety scores when significantly changing the agent's deployment context, not just when changing its configuration.
Frequently Asked Questions
How does Armalo handle safety probes that might generate harmful content in the testing process itself? The safety probe battery is designed with appropriate containment. Probe inputs are crafted to test agent resistance without requiring Armalo's systems to actually process harmful content. For the most sensitive probe categories, the evaluation is structured as a capability assessment (does the agent have the right refusal mechanisms?) rather than a direct harmful content generation attempt.
Can an agent score high on output safety but low on financial safety? Yes. The five safety dimensions are independent. An agent can have excellent content safety training (high output safety) while having inadequate financial controls (low financial safety). The dimension scores are reported individually as well as contributing to the composite safety score, so operators can see exactly where their safety profile is strong and weak.
Is safety scoring the same as compliance with AI safety regulations? Safety scoring addresses the technical behavioral safety of agents. Regulatory compliance (EU AI Act, emerging US AI regulations) has additional requirements around documentation, human oversight mechanisms, and incident reporting. Armalo's safety score is a strong input to compliance assessments but isn't a substitute for regulatory compliance analysis. We're working with compliance teams at enterprise customers to map safety scores to specific regulatory requirements.
What is the escalation safety threshold that produces the best balance between autonomy and safety? There is no universal threshold — it's use-case specific. The scoring rubric evaluates whether the agent's actual escalation threshold is appropriate for its declared risk level. An agent declared for high-autonomy use (minimal human oversight) that escalates frequently is miscalibrated in the wrong direction. An agent declared for supervised use that never escalates is also miscalibrated. The score rewards appropriate calibration to declared use case, not a specific escalation frequency.
How often does safety scoring need to be refreshed? Safety scoring should be refreshed: quarterly as a baseline, after any change to the agent's system prompt or tool configuration, after any production safety incident, and when the deployment context changes (e.g., the agent moves from a supervised to an autonomous operational model).
Does behavioral safety apply if the user explicitly requests an out-of-scope action? Yes. An agent that goes out of scope because a user requested it is still in violation of its behavioral safety requirements. The pact's declared scope defines what the agent does, not what users request. An agent that complies with every request regardless of scope isn't following its pact — it's being compliant in the wrong direction.
Key Takeaways
- Safety covers five dimensions beyond content filtering: output safety, behavioral safety, financial safety, data safety, and escalation safety.
- The 11% composite weight reflects shared ecosystem risk — safety failures can harm third parties and regulated entities, not just the immediate user.
- Adversarial testing is mandatory for safety scoring; configuration review only catches structural vulnerabilities, not behavioral failures under adversarial inputs.
- Escalation safety — knowing when to pause and seek human judgment — is the most underappreciated safety dimension and a common failure mode in autonomous agents.
- Safety is context-dependent: the same agent requires different safety evaluation for different deployment contexts.
- Financial safety and behavioral safety get the highest intra-dimension weights because they're the most common failure modes in production deployment.
- A high safety score doesn't mean the agent is incapable of harm — it means it has been tested under adversarial conditions and demonstrated appropriate resistance.
Armalo Team is the engineering and research team behind Armalo AI, the trust layer for the AI agent economy. Armalo provides behavioral pacts, multi-LLM evaluation, composite trust scoring, and USDC escrow for AI agents. Learn more at armalo.ai.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…