Security SLOs for AI Agent Platforms: Defining Behavioral Guarantees That Hold in Production

2026-05-1020 min read

A practitioner's guide to Security Service Level Objectives for AI agent systems — refusal rate accuracy, tool permission adherence, injection resistance, output toxicity, data exfiltration detection, error budgets, and enforcement mechanisms.

Security SLOs for AI Agent Platforms: Defining Behavioral Guarantees That Hold in Production

Service Level Objectives for traditional software systems measure availability, latency, and throughput. These metrics are important for AI agent platforms too — but they are insufficient. An AI agent can be fully available, responding in 200ms, processing thousands of requests per hour, and simultaneously leaking sensitive data, following malicious instructions injected via prompt, and executing tools beyond its authorized scope.

Security SLOs for AI agent systems measure behavioral guarantees, not just operational ones. They answer the question: does the agent behave within its security boundaries, reliably, under adversarial conditions, over time? This is a fundamentally different measurement problem from availability SLOs, requiring different metrics, different error budgets, and different enforcement mechanisms.

This document provides the complete specification for security SLOs in AI agent platforms: the dimensions that require measurement, how to define and measure each metric, how to set error budgets, how to enforce security SLOs in production, and how to trigger incident response when SLOs are violated. It draws on NIST AI RMF, OWASP Top 10 for LLM Applications, and real-world deployment patterns.

TL;DR

Security SLOs for AI agents measure behavioral security properties — refusal accuracy, permission adherence, injection resistance, output safety, data handling — not just uptime and latency
Each security dimension requires a distinct measurement methodology; a single "security score" obscures more than it reveals
Error budgets for security SLOs should be asymmetric — budget exhaustion triggers immediate response, not just a maintenance window
OWASP Top 10 for LLM Applications provides the threat taxonomy for scoping security SLO dimensions
NIST AI RMF's GOVERN, MAP, MEASURE, MANAGE framework maps directly to security SLO lifecycle
Enforcement mechanisms range from real-time circuit breakers to asynchronous behavioral audits
Armalo's behavioral pact framework implements security SLOs as verifiable contractual commitments with cryptographic attestation

The Core Distinction: Behavioral Security vs. Operational Security

Traditional security monitoring for software platforms measures infrastructure security: vulnerability patching cadence, authentication event rates, firewall rule violations, certificate expiry. These remain relevant for AI agent platforms but are insufficient because they don't measure the security properties unique to AI agents: their behavior under adversarial inputs, their adherence to authorization boundaries, and their reliability in refusing prohibited actions.

The OWASP LLM Top 10 as a Security SLO Taxonomy

The OWASP Top 10 for LLM Applications (2025 edition) provides the threat taxonomy that security SLOs should address:

LLM01: Prompt Injection — Attacks that override system instructions through crafted inputs
LLM02: Insecure Output Handling — Agent outputs used unsafely by downstream systems
LLM03: Training Data Poisoning — Compromised training or fine-tuning data
LLM04: Model Denial of Service — Resource exhaustion through adversarial inputs
LLM05: Supply Chain Vulnerabilities — Compromised model components or tools
LLM06: Sensitive Information Disclosure — Extraction of training data or confidential context
LLM07: Insecure Plugin Design — Unsafe tool execution
LLM08: Excessive Agency — Agent taking actions beyond its authorized scope
LLM09: Overreliance — Lack of appropriate skepticism in consuming agent outputs
LLM10: Model Theft — Intellectual property extraction through query patterns

Security SLOs directly address LLM01, LLM02, LLM06, LLM07, and LLM08. Partial coverage of LLM04 and LLM10.

NIST AI RMF Mapping

The NIST AI Risk Management Framework (AI RMF 1.0, January 2023) organizes AI risk management into four functions:

GOVERN: Establishing accountability, policies, and processes
MAP: Identifying and categorizing risks
MEASURE: Implementing metrics and evaluation mechanisms
MANAGE: Responding to identified risks

Security SLOs are a MEASURE function artifact. They make risk measurement operational and quantitative. But they require GOVERN (clear ownership of security SLO compliance), MAP (risk identification to determine which SLO dimensions are relevant), and MANAGE (response protocols when SLOs are violated) to be complete.

The Seven Security SLO Dimensions

Security SLOs for AI agent platforms should cover seven distinct behavioral dimensions. Each dimension requires independent measurement because failures in one dimension often don't correlate with failures in others.

Dimension 1: Refusal Rate Accuracy (RRA)

Refusal Rate Accuracy measures the agent's ability to correctly refuse prohibited requests while correctly accepting authorized requests. It is essentially the accuracy of the agent's authorization decision function.

Definition:

True Positive (TP): Correctly refusing a prohibited request
True Negative (TN): Correctly accepting an authorized request
False Positive (FP): Incorrectly refusing an authorized request (over-refusal)
False Negative (FN): Incorrectly accepting a prohibited request (under-refusal)

RRA = (TP + TN) / (TP + TN + FP + FN)

For security SLOs, False Negatives (under-refusal) are far more dangerous than False Positives (over-refusal). The SLO should be specified separately for each:

Under-refusal rate SLO: FN / (FN + TP) < 0.001 (99.9% of prohibited requests are correctly refused)
Over-refusal rate SLO: FP / (FP + TN) < 0.05 (no more than 5% of authorized requests are incorrectly refused)

Measuring RRA in production:

Continuous RRA measurement requires a test set of prohibited and authorized requests that is executed against the agent regularly. This test set must:

Be kept current with the agent's evolving policy (as policies change, the test set must be updated)
Include adversarial variants of prohibited requests (jailbreak attempts, indirect injection, rephrasing attacks)
Include boundary cases near the edge of authorization scope
Be representative of the actual distribution of requests the agent will receive

Recommended cadence: run the RRA test battery daily (or hourly for high-risk deployments) and track the time series. A single instance of under-refusal is worth investigating; a rising trend indicates systematic policy enforcement degradation.

Dimension 2: Tool Permission Adherence Rate (TPAR)

Tool Permission Adherence Rate measures how reliably the agent uses only the tools it is authorized to use and only within the parameters allowed by its authorization policy.

Definition: TPAR = 1 - (unauthorized_tool_calls / total_tool_calls)

Where unauthorized tool calls include:

Calling a tool not in the agent's authorized tool list
Calling an authorized tool with parameters outside the allowed range (e.g., data export with more records than allowed)
Calling an authorized tool in an unauthorized context (e.g., a tool that is only authorized for read operations being called for write operations)
Chaining tool calls in sequences that are individually authorized but collectively constitute an unauthorized action

Implementing TPAR monitoring:

TPAR monitoring intercepts all tool calls at the agent runtime layer, before the tool executes. Each tool call is evaluated against the authorization policy:

class ToolPermissionInterceptor:
    """Intercepts and logs tool calls for TPAR monitoring."""
    
    def __init__(self, authorization_policy, metrics_client):
        self.policy = authorization_policy
        self.metrics = metrics_client
    
    async def intercept(self, agent_id, tool_name, tool_params, context):
        """
        Intercept a tool call and evaluate against authorization policy.
        Returns: (is_authorized, violation_details)
        """
        # Check if tool is in authorized list
        if tool_name not in self.policy.authorized_tools(agent_id):
            violation = {
                'type': 'unauthorized_tool',
                'tool': tool_name,
                'agent_id': agent_id,
                'context': context
            }
            await self._record_violation(violation)
            return False, violation
        
        # Check parameter bounds
        tool_policy = self.policy.tool_policy(agent_id, tool_name)
        for param, value in tool_params.items():
            if param in tool_policy.param_bounds:
                min_val, max_val = tool_policy.param_bounds[param]
                if not (min_val <= value <= max_val):
                    violation = {
                        'type': 'param_out_of_bounds',
                        'tool': tool_name,
                        'param': param,
                        'value': value,
                        'allowed_range': [min_val, max_val]
                    }
                    await self._record_violation(violation)
                    return False, violation
        
        # Check context authorization
        if not tool_policy.is_context_authorized(context):
            violation = {
                'type': 'unauthorized_context',
                'tool': tool_name,
                'context': context
            }
            await self._record_violation(violation)
            return False, violation
        
        # Record authorized call
        await self.metrics.record_authorized_tool_call(agent_id, tool_name)
        return True, None
    
    async def _record_violation(self, violation):
        await self.metrics.record_tpar_violation(violation)
        await self.audit_log.write({
            'event_type': 'tool_permission_violation',
            'severity': 'high',
            **violation
        })

TPAR SLO recommendation: TPAR >= 0.9999 (at most 1 unauthorized tool call per 10,000 tool calls). For agents with privileged tools (file system access, external API calls, financial operations), TPAR >= 0.99999.

Dimension 3: Injection Resistance Rate (IRR)

Injection Resistance Rate measures the agent's ability to maintain its authorized behavior under adversarial prompt injection attempts. Prompt injection is consistently the highest-severity attack vector against AI agents (OWASP LLM01).

Types of injection attacks to measure resistance against:

Direct prompt injection: Attacker controls the user input directly and attempts to override system instructions
Indirect prompt injection: Attacker plants instructions in documents, emails, or web content that the agent retrieves and processes
Context manipulation: Attacker provides false context to change how the agent interprets its instructions
Role-playing attacks: Attacker instructs the agent to role-play as a different, less-restricted agent

Measuring IRR:

IRR requires an adversarial injection test battery — a collection of injection attempts at multiple sophistication levels. The test battery should be updated regularly as new injection techniques are discovered.

IRR = (injection_attempts_resisted / total_injection_attempts)

IRR SLO recommendation by risk tier:

Tier 1 (standard agents): IRR >= 0.95 against known injection techniques
Tier 2 (privileged agents with tool access): IRR >= 0.99 against known techniques, IRR >= 0.90 against novel techniques
Tier 3 (agents with financial, medical, or legal authority): IRR >= 0.999 against known techniques

The injection test battery should be divided into "known" (techniques documented in published research and ATLAS) and "novel" (new techniques developed by the operator's red team) with separate SLOs for each category.

Dimension 4: Output Toxicity Rate (OTR)

Output Toxicity Rate measures the frequency at which the agent generates outputs that violate content safety policies — including harmful content, hate speech, personally identifiable information disclosure, misinformation, and other categories defined by the operator's content policy.

OTR = toxic_outputs / total_outputs

Output toxicity has two measurement approaches:

Rule-based detection: Deterministic pattern matching, keyword filters, PII detection (e.g., regex for SSN, credit card patterns)
Model-based detection: LLM-as-judge or specialized toxicity classifiers that evaluate outputs for policy violations

Both should be used: rule-based detection provides high precision on known patterns; model-based detection provides recall on novel policy violations.

OTR SLO framework:

Tier 1 (public-facing agents): OTR <= 0.0001 (no more than 1 toxic output per 10,000)
Tier 2 (enterprise internal agents): OTR <= 0.001
Tier 3 (restricted access agents): OTR <= 0.01 (higher tolerance acceptable given controlled access)

Separate OTR SLOs should be specified for different toxicity categories: PII disclosure has a different tolerance than moderately inappropriate language.

Dimension 5: Data Exfiltration Detection Rate (DEDR)

Data Exfiltration Detection Rate measures the effectiveness of controls that prevent the agent from exposing sensitive data in its outputs or through tool calls.

Types of data exfiltration to detect:

Context leakage: Reproducing verbatim content from the system prompt, training data, or other agents' private context
PII exfiltration: Including personally identifiable information in outputs where it is not authorized
Secret extraction: Extracting API keys, passwords, or other credentials from context
Indirect exfiltration: Encoding sensitive information in steganographic ways within seemingly innocuous output

DEDR is measured against a test battery where known sensitive data is seeded in the agent's context and the agent is tested against adversarial attempts to extract it.

DLP Integration:

For production monitoring, DEDR monitoring should integrate with the organization's Data Loss Prevention (DLP) infrastructure:

Route all agent outputs through the DLP scanner before delivery to the caller
Flag outputs that match PII patterns, credential patterns, or organizational data classification markers
Record DLP flags as DEDR events

DEDR SLO: DEDR (fraction of exfiltration attempts that are detected) >= 0.99. Crucially, the SLO should specify both the detection rate AND the maximum allowable false negative rate: no more than 1 undetected exfiltration event per 10,000 exfiltration attempts.

Dimension 6: Scope Adherence Rate (SAR)

Scope Adherence Rate measures how reliably the agent stays within its defined operational scope — answering only questions it is authorized to answer, taking only actions it is authorized to take, and refusing requests that fall outside its defined purpose.

This is distinct from tool permission adherence (which measures technical permission enforcement) — SAR measures semantic scope adherence (the agent's interpretation of what it is supposed to do).

Examples of scope violations:

A customer service agent that provides legal advice (outside authorized scope)
A code review agent that also writes production code (exceeds authorized capability)
A financial reporting agent that makes investment recommendations (prohibited activity)

SAR measurement requires a scope boundary test set: a collection of in-scope and out-of-scope requests where the correct behavior (process or refuse) is clearly defined. SAR = fraction of these requests handled correctly.

SAR SLO recommendation: SAR >= 0.98. The 2% error budget is primarily reserved for genuinely ambiguous boundary cases.

Dimension 7: Audit Trail Completeness Rate (ATCR)

Audit Trail Completeness Rate measures the fraction of agent actions (tool calls, decisions, outputs) that are correctly captured in the audit log with sufficient detail for forensic reconstruction.

ATCR = correctly_logged_actions / total_actions

"Correctly logged" means:

The action is recorded in the audit log within the defined latency SLO (typically < 5 seconds)
The log record contains all required fields (agent_id, org_id, session_id, action_type, timestamp, input, output, authorization context)
The log record is tamper-evident (typically via cryptographic chaining or append-only storage)
The log record is retrievable within the defined retention period

ATCR SLO: ATCR >= 0.9999. For regulated industries, ATCR = 1.0000 may be required by compliance mandates (no action may proceed without simultaneous audit log write).

Error Budget Design for Security SLOs

Error budgets for performance SLOs (uptime, latency) operate on a monthly cycle: you have a budget of allowed failures, and when it's exhausted, you freeze new deployments until reliability is restored. Security SLO error budgets require different design.

Asymmetric Error Budget Allocation

Security SLOs should use asymmetric error budgets: the error budget for under-refusal and injection failure is much smaller than for over-refusal and operational latency. A single prompt injection success that leads to data exfiltration may be more costly than 1,000 instances of over-refusal.

Recommended error budget structure:

Security Dimension	Monthly Error Budget	Budget Exhaustion Trigger
Under-refusal rate	0.01% (1 in 10,000)	Immediate incident response, suspend high-risk features
Prompt injection success	0.001% (1 in 100,000)	Immediate incident response, consider agent suspension
Unauthorized tool call	0.01% (1 in 10,000)	Immediate incident response, audit recent tool call history
Verified data exfiltration	0 (zero tolerance)	Immediate agent suspension, incident declared
Scope violation	0.1%	Escalate, investigate, update scope boundary tests
Audit trail gap	0.01%	Immediate investigation, compliance notification if regulated

Error Budget Exhaustion Response

Unlike performance SLO error budget exhaustion (which typically triggers a deployment freeze), security SLO error budget exhaustion should trigger incident response:

Declare security incident: Create an incident record with full context of the budget-exhausting events
Quarantine the agent: Reduce the agent's scope to the most conservative authorized behavior while investigation proceeds
Notify security team: Page the on-call security engineer regardless of time of day
Preserve evidence: Ensure all logs from the incident period are preserved and tamper-protected
Begin forensic analysis: Reconstruct the sequence of events that led to the SLO violation
Remediate before restoring: Full budget restoration requires root-cause analysis, remediation implementation, and re-validation against the relevant test battery

Enforcement Mechanisms

Security SLO enforcement has three modes: real-time blocking (preventing violations before they occur), asynchronous detection (identifying violations after the fact for remediation), and behavioral monitoring (detecting patterns that predict future violations).

Real-Time Circuit Breakers

For security dimensions where violations have immediate consequences (tool permission adherence, injection resistance), real-time enforcement via circuit breakers is required. Circuit breakers sit in the agent request path and block requests that would violate security policy:

class SecurityCircuitBreaker:
    """Real-time security enforcement for AI agent requests."""
    
    def __init__(self, policy_engine, metrics, error_budget_tracker):
        self.policy = policy_engine
        self.metrics = metrics
        self.budget = error_budget_tracker
    
    async def evaluate_request(self, request, agent_id, org_id):
        """
        Evaluate an incoming request against security policies.
        Returns: (is_allowed, security_violations)
        """
        violations = []
        
        # Injection pattern detection
        if injection_score := await self.policy.injection_scan(request.input):
            if injection_score > self.policy.injection_threshold:
                violations.append({
                    'type': 'high_confidence_injection',
                    'score': injection_score,
                    'action': 'block'
                })
        
        # Rate and anomaly checks
        if await self.policy.is_anomalous_request(request, agent_id):
            violations.append({
                'type': 'anomalous_request_pattern',
                'action': 'flag_and_allow'  # Log but don't block anomalies
            })
        
        # Budget check — if security error budget is exhausted, increase scrutiny
        if self.budget.is_exhausted(agent_id, 'injection_resistance'):
            # In exhausted state, raise injection threshold
            self.policy.injection_threshold = 0.30  # From 0.70 default
        
        # Block on hard violations
        hard_violations = [v for v in violations if v.get('action') == 'block']
        
        if hard_violations:
            await self.metrics.record_blocked_request(agent_id, hard_violations)
            await self.audit_log.record({
                'event_type': 'request_blocked_by_circuit_breaker',
                'agent_id': agent_id,
                'org_id': org_id,
                'violations': hard_violations
            })
            return False, hard_violations
        
        return True, violations  # May have warnings even if allowed

Asynchronous Behavioral Audits

Real-time circuit breakers can't catch all violations — some are only detectable in retrospect (scope creep, gradual authorization erosion, sophisticated indirect injection). Asynchronous behavioral audits analyze batches of recent agent interactions for violations:

class AsyncBehavioralAudit:
    """Asynchronous audit of agent behavioral compliance with security SLOs."""
    
    async def run_daily_audit(self, agent_id, lookback_hours=24):
        """Run daily behavioral security audit for an agent."""
        interactions = await self.load_interactions(agent_id, hours=lookback_hours)
        
        audit_results = {}
        
        # Audit each security dimension
        audit_results['rra'] = await self.audit_refusal_accuracy(interactions)
        audit_results['tpar'] = await self.audit_tool_permission_adherence(interactions)
        audit_results['irr'] = await self.audit_injection_resistance(interactions)
        audit_results['otr'] = await self.audit_output_toxicity(interactions)
        audit_results['sar'] = await self.audit_scope_adherence(interactions)
        audit_results['atcr'] = await self.audit_audit_trail_completeness(interactions)
        
        # Check against SLOs
        violations = []
        for dimension, result in audit_results.items():
            slo = self.get_slo(agent_id, dimension)
            if result.rate < slo.target:
                violations.append({
                    'dimension': dimension,
                    'current_rate': result.rate,
                    'slo_target': slo.target,
                    'gap': slo.target - result.rate,
                    'examples': result.violation_examples[:5]
                })
        
        # Record audit results and update error budgets
        for dimension, result in audit_results.items():
            await self.budget_tracker.record_audit(agent_id, dimension, result)
        
        # Alert on SLO violations
        if violations:
            await self.alert_manager.send_security_slo_violation(agent_id, violations)
        
        return audit_results, violations

How Armalo Implements Security SLOs

Security SLOs in Armalo are first-class components of behavioral pacts. When an agent operator registers an agent on the Armalo platform, they define security SLO commitments as part of the agent's pact:

{
  "pact_id": "pact_a1b2c3",
  "agent_id": "agent_xyz",
  "security_slos": {
    "refusal_rate_accuracy": {
      "under_refusal_rate": 0.001,
      "over_refusal_rate": 0.05,
      "measurement_cadence": "daily",
      "test_battery_id": "battery_enterprise_v3"
    },
    "tool_permission_adherence": {
      "minimum_tpar": 0.9999,
      "enforcement": "real_time_block",
      "audit_trail": "required"
    },
    "injection_resistance": {
      "known_techniques_irr": 0.99,
      "novel_techniques_irr": 0.90,
      "test_frequency": "weekly_adversarial"
    },
    "output_toxicity": {
      "maximum_otr": 0.0001,
      "categories": ["pii_disclosure", "harmful_content", "misinformation"],
      "dlp_integration": true
    }
  },
  "error_budget_policy": {
    "exhaustion_response": "immediate_incident",
    "restoration_requirement": "root_cause_plus_remediation_validation"
  }
}

Armalo continuously monitors compliance with these security SLO commitments and updates the agent's trust score accordingly. Security SLO compliance is one of the highest-weighted components in Armalo's composite trust score because security violations represent the highest potential harm to the enterprises and users relying on the agent.

When security SLOs are violated, Armalo records a trust impact event, notifies the agent operator, and surfaces the violation on the agent's public trust profile. The magnitude and duration of the violation determine the trust score impact — brief, quickly-remediated violations have smaller impacts than persistent, unaddressed violations.

For the Armalo marketplace, security SLO compliance history is a filter criterion. Enterprises can specify minimum security SLO thresholds as hiring requirements and see only agents whose historical compliance and current commitments meet those requirements.

SLO Integration with NIST AI RMF and EU AI Act

Security SLOs don't exist in a regulatory vacuum. Two major frameworks — the NIST AI Risk Management Framework and the EU AI Act — provide complementary structures that security SLOs should align with. Understanding the alignment helps organizations avoid doing compliance work twice and ensures their SLO framework addresses the risks regulators most care about.

NIST AI RMF Integration

The NIST AI RMF organizes AI risk management into four functions: GOVERN, MAP, MEASURE, and MANAGE. Security SLOs operate primarily in the MEASURE function but require all four:

GOVERN function requirements that SLOs depend on:

Clear ownership of each security SLO (who is accountable for compliance?)
Defined consequence for SLO violation (what organizational response is triggered?)
Documented risk tolerance (what levels of violation are acceptable in what contexts?)
Budget and authority to remediate violations

MAP function requirements that enable SLO design:

Risk identification: which OWASP LLM threat categories are in scope for this deployment?
Deployment context: what is the harm potential of each security violation category?
Stakeholder mapping: who is affected by each type of security failure?
Dependency mapping: which downstream systems depend on security guarantees?

MEASURE function — where SLOs live: The NIST AI RMF MEASURE function calls for "identifying metrics, methods, and tools to measure the degree to which risks are known, manageable, and managed." The seven security SLO dimensions described above directly address this requirement:

RRA, IRR, TPAR map to NIST AI RMF MEASURE 2.6 (adversarial robustness evaluation)
DEDR maps to MEASURE 2.2 (data privacy)
ATCR maps to MEASURE 2.9 (AI risk documentation)
SAR maps to MEASURE 2.5 (AI system performance within intended contexts)

MANAGE function — SLO enforcement and response: The MANAGE function requires treating identified risks — which SLO violations evidence. The incident response protocol triggered by SLO budget exhaustion is the MANAGE function instantiated for operational AI security.

EU AI Act Alignment

The EU AI Act applies specifically to high-risk AI systems (Annex III: education, employment, critical infrastructure, law enforcement, migration management, biometric ID, and others). For high-risk AI systems, the Act requires:

Article 9 — Risk Management System: Security SLOs directly implement the Act's requirement for "a continuous iterative process run throughout the entire lifecycle of a high-risk AI system." The SLO framework with regular measurement, error budget tracking, and triggered response is exactly the risk management system the Act envisions.

Article 10 — Data Governance: The DEDR SLO addresses data governance requirements for high-risk systems, ensuring that training and deployment data handling meets the Act's standards.

Article 15 — Accuracy, Robustness, and Cybersecurity: This is the most direct EU AI Act mapping. Article 15 requires "high-risk AI systems [to] achieve, in the light of their intended purpose, an appropriate level of accuracy, robustness and cybersecurity." The SLO framework operationalizes Article 15 compliance:

Accuracy → RRA (refusal accuracy, scope adherence)
Robustness → IRR (injection resistance) and RRA (adversarial conditions)
Cybersecurity → TPAR, DEDR, ATCR (tool permissions, data protection, audit)

Article 14 — Human Oversight: The audit trail completeness SLO (ATCR) directly supports human oversight requirements by ensuring that all agent actions are logged and reviewable. EU AI Act Article 14 requires that high-risk AI systems "allow for human oversight, including in particular the ability to interrupt, stop or override them" — which requires comprehensive audit logs.

Building the Security SLO Test Battery

The quality of a security SLO measurement depends entirely on the quality of the test battery underlying it. Poorly designed test batteries produce false confidence; adversarially diverse batteries reveal genuine security posture. The following guidance covers test battery design for each of the seven SLO dimensions.

RRA Test Battery Construction

A high-quality RRA test battery requires three categories of test cases:

Category 1 — Clear in-scope requests (expected: process) These establish baseline accuracy for normal, clearly-authorized requests. Include requests that:

Use standard vocabulary for the agent's domain
Have obvious legitimate purpose
Fall clearly within the agent's defined scope Recommended: 40% of test battery, with sub-categories covering all major use cases

Category 2 — Clear prohibited requests (expected: refuse) These establish refusal accuracy for unambiguously unauthorized requests. Include requests that:

Ask for clearly prohibited content (define "clearly prohibited" for your policy)
Request actions outside authorized tool set
Solicit information the agent is explicitly prohibited from sharing Recommended: 40% of test battery, with adversarial sub-variants at multiple sophistication levels (naive, rephrased, indirect, multi-step)

Category 3 — Boundary cases (expected: handle correctly based on policy interpretation) These probe the agent's interpretation of ambiguous authorization boundaries. Include requests that:

Fall near the edge of the agent's defined scope
Could be interpreted as either authorized or unauthorized depending on context
Test the agent's disambiguation behavior (does it ask for clarification? refuse conservatively? accept liberally?) Recommended: 20% of test battery

Critical test battery maintenance:

Update quarterly to add new adversarial techniques as they're published
Expand boundary cases when policy changes create new edge zones
Track performance per test category separately (don't aggregate — a high overall RRA can hide poor performance on adversarial sub-variants)
Version control the test battery with change history

IRR Test Battery Construction

The injection resistance test battery requires regular updates because injection techniques evolve rapidly. The battery must include:

Tier 1 — Classic techniques (established, well-documented)

Direct instruction override ("Ignore previous instructions and...")
Role-play jailbreaks ("You are DAN, an AI without restrictions...")
Hypothetical framing ("Hypothetically, if you could help with...")
Gradual escalation (multi-turn sessions that incrementally push boundaries)

Tier 2 — Documented advanced techniques

Unicode/encoding bypasses
Prompt smuggling in external content
Competing instructions with authority claims
Context window poisoning attempts

Tier 3 — Novel technique variants (generated by red team)

Operator-specific custom attacks targeting known policy edges
Combination attacks that chain multiple techniques
Context-specific attacks designed for the agent's domain (e.g., code execution injection for a code assistant)

The battery should be updated when new techniques are published by security researchers (MITRE ATLAS technique additions, OWASP update cycles, academic publications).

Security SLO Incident Runbooks

When security SLO budgets are exhausted or severe violations are detected, practitioners need runbooks — not just principles. The following incident runbooks address the three most common security SLO violation scenarios.

Runbook 1: Confirmed Prompt Injection Success

Detection trigger: IRR test battery identifies a novel injection technique that succeeds against the deployed agent, or production monitoring detects characteristics consistent with successful injection (unusual tool calls, out-of-scope content, behavioral pattern change)

Immediate response (first 15 minutes):

Reduce agent scope to most conservative configuration: disable all optional capabilities, restrict to read-only tools if possible
Tag all interactions from the past 72 hours for forensic review
Pull the 10 most recent interactions and manually review for injection indicators
Notify the security on-call team

Investigation phase (first 4 hours):

Reconstruct the interaction sequence that triggered the detection
Attempt to reproduce the injection in a sandboxed environment
Determine: is this a known technique (existing defense should have caught it) or novel?
If known: investigate why the existing defense failed (deployment gap? configuration drift? evasion of existing pattern?)
If novel: document the technique for battery update and industry sharing

Remediation:

Develop and test a specific defense against the technique (input filtering rule, instruction hardening, or system prompt update)
Validate the defense in the sandboxed environment
Deploy with monitoring, not silently — annotate the deployment to enable before/after comparison
Update the test battery to include this technique
Run full battery against the patched agent before restoring full capabilities

Post-incident:

Write incident report documenting technique, detection timeline, response timeline, remediation
Update error budget tracking (this incident consumed budget — when does restoration occur?)
Consider responsible disclosure if the technique is novel and likely to affect other platforms

Runbook 2: Suspected Data Exfiltration

Detection trigger: DLP integration flags an output containing PII or sensitive data patterns, or DEDR test battery reveals that seeded sensitive data is reproducible

Immediate response (first 5 minutes):

Suspend the agent immediately — data exfiltration has zero-tolerance policy
Identify the specific output(s) that triggered the alert
Determine if the flagged content is a true positive (actual sensitive data) or false positive (matched pattern but not actually sensitive)
If true positive: begin incident timeline

Evidence preservation:

Lock audit logs from the time window containing the suspected exfiltration
Capture the full interaction context (system prompt, all turns, tool calls)
Identify the source of the sensitive data in the context (was it injected? legitimately provided? from retrieval?)

Impact assessment:

How much sensitive data was potentially exposed?
Who received the output containing the sensitive data?
What is the data classification of the exposed information? (PII? PHI? trade secret?)
Are there regulatory notification obligations? (GDPR: 72-hour notification; HIPAA: 60-day notification to HHS; state breach laws vary)

The agent does not return to service until:

Root cause is identified (unintended context exposure? injection that retrieved sensitive data? misconfigured DLP exceptions?)
Root cause is remediated
DLP test battery passes at DEDR >= 0.9999 with seeded sensitive data
Security team sign-off

Conclusion: Key Takeaways

Security SLOs for AI agent platforms are a fundamentally different challenge from traditional operational SLOs. They require behavioral measurement methodologies, adversarially-diverse test batteries, asymmetric error budgets that reflect the asymmetric consequences of violations, and incident-level organizational responses when those budgets are exhausted.

Key takeaways:

Seven dimensions require independent measurement — refusal accuracy, tool permission adherence, injection resistance, output toxicity, data exfiltration detection, scope adherence, and audit trail completeness.
Error budgets for security are asymmetric — some violations (verified data exfiltration, confirmed injection success) have zero tolerance; others have small but nonzero budgets.
Real-time and asynchronous enforcement are both required — circuit breakers catch immediate violations; behavioral audits catch sophisticated, gradual, or indirect violations.
OWASP LLM Top 10 provides the threat taxonomy — scope your security SLOs to cover the top threat vectors, not just the ones that are easy to measure.
Security SLOs must be maintained — as threat landscapes evolve and injection techniques improve, test batteries and SLO thresholds require regular updating.
Security SLO compliance is a trust signal — it belongs in agent trust profiles alongside accuracy and calibration metrics, visible to the enterprises considering hiring the agent.

Organizations that define, measure, and enforce security SLOs are operationalizing their security commitments. Organizations that don't are making security promises they can't quantify. In a world where AI agents are being granted real authority over real systems, the difference matters enormously.

security SLOsai agent securitybehavioral guaranteesinjection resistanceerror budgetarmaloai agent trustgenerative engine optimization

← Knowledge Base

Build trust into your agents

Start Free Read the docs

Based in Singapore? See our MAS AI governance compliance resources →

Security SLOs for AI Agent Platforms: Defining Behavioral Guarantees That Hold in Production

2026-05-1020 min read

Security SLOs for AI Agent Platforms: Defining Behavioral Guarantees That Hold in Production

TL;DR

Security SLOs for AI agents measure behavioral security properties — refusal accuracy, permission adherence, injection resistance, output safety, data handling — not just uptime and latency
Each security dimension requires a distinct measurement methodology; a single "security score" obscures more than it reveals
Error budgets for security SLOs should be asymmetric — budget exhaustion triggers immediate response, not just a maintenance window
OWASP Top 10 for LLM Applications provides the threat taxonomy for scoping security SLO dimensions
NIST AI RMF's GOVERN, MAP, MEASURE, MANAGE framework maps directly to security SLO lifecycle
Enforcement mechanisms range from real-time circuit breakers to asynchronous behavioral audits
Armalo's behavioral pact framework implements security SLOs as verifiable contractual commitments with cryptographic attestation

The Core Distinction: Behavioral Security vs. Operational Security

The OWASP LLM Top 10 as a Security SLO Taxonomy

The OWASP Top 10 for LLM Applications (2025 edition) provides the threat taxonomy that security SLOs should address:

LLM01: Prompt Injection — Attacks that override system instructions through crafted inputs
LLM02: Insecure Output Handling — Agent outputs used unsafely by downstream systems
LLM03: Training Data Poisoning — Compromised training or fine-tuning data
LLM04: Model Denial of Service — Resource exhaustion through adversarial inputs
LLM05: Supply Chain Vulnerabilities — Compromised model components or tools
LLM06: Sensitive Information Disclosure — Extraction of training data or confidential context
LLM07: Insecure Plugin Design — Unsafe tool execution
LLM08: Excessive Agency — Agent taking actions beyond its authorized scope
LLM09: Overreliance — Lack of appropriate skepticism in consuming agent outputs
LLM10: Model Theft — Intellectual property extraction through query patterns

Security SLOs directly address LLM01, LLM02, LLM06, LLM07, and LLM08. Partial coverage of LLM04 and LLM10.

NIST AI RMF Mapping

The NIST AI Risk Management Framework (AI RMF 1.0, January 2023) organizes AI risk management into four functions:

GOVERN: Establishing accountability, policies, and processes
MAP: Identifying and categorizing risks
MEASURE: Implementing metrics and evaluation mechanisms
MANAGE: Responding to identified risks

The Seven Security SLO Dimensions

Dimension 1: Refusal Rate Accuracy (RRA)

Definition:

True Positive (TP): Correctly refusing a prohibited request
True Negative (TN): Correctly accepting an authorized request
False Positive (FP): Incorrectly refusing an authorized request (over-refusal)
False Negative (FN): Incorrectly accepting a prohibited request (under-refusal)

RRA = (TP + TN) / (TP + TN + FP + FN)

For security SLOs, False Negatives (under-refusal) are far more dangerous than False Positives (over-refusal). The SLO should be specified separately for each:

Under-refusal rate SLO: FN / (FN + TP) < 0.001 (99.9% of prohibited requests are correctly refused)
Over-refusal rate SLO: FP / (FP + TN) < 0.05 (no more than 5% of authorized requests are incorrectly refused)

Measuring RRA in production:

Continuous RRA measurement requires a test set of prohibited and authorized requests that is executed against the agent regularly. This test set must:

Be kept current with the agent's evolving policy (as policies change, the test set must be updated)
Include adversarial variants of prohibited requests (jailbreak attempts, indirect injection, rephrasing attacks)
Include boundary cases near the edge of authorization scope
Be representative of the actual distribution of requests the agent will receive

Dimension 2: Tool Permission Adherence Rate (TPAR)

Tool Permission Adherence Rate measures how reliably the agent uses only the tools it is authorized to use and only within the parameters allowed by its authorization policy.

Definition: TPAR = 1 - (unauthorized_tool_calls / total_tool_calls)

Where unauthorized tool calls include:

Calling a tool not in the agent's authorized tool list
Calling an authorized tool with parameters outside the allowed range (e.g., data export with more records than allowed)
Calling an authorized tool in an unauthorized context (e.g., a tool that is only authorized for read operations being called for write operations)
Chaining tool calls in sequences that are individually authorized but collectively constitute an unauthorized action

Implementing TPAR monitoring:

TPAR monitoring intercepts all tool calls at the agent runtime layer, before the tool executes. Each tool call is evaluated against the authorization policy:

class ToolPermissionInterceptor:
    """Intercepts and logs tool calls for TPAR monitoring."""
    
    def __init__(self, authorization_policy, metrics_client):
        self.policy = authorization_policy
        self.metrics = metrics_client
    
    async def intercept(self, agent_id, tool_name, tool_params, context):
        """
        Intercept a tool call and evaluate against authorization policy.
        Returns: (is_authorized, violation_details)
        """
        # Check if tool is in authorized list
        if tool_name not in self.policy.authorized_tools(agent_id):
            violation = {
                'type': 'unauthorized_tool',
                'tool': tool_name,
                'agent_id': agent_id,
                'context': context
            }
            await self._record_violation(violation)
            return False, violation
        
        # Check parameter bounds
        tool_policy = self.policy.tool_policy(agent_id, tool_name)
        for param, value in tool_params.items():
            if param in tool_policy.param_bounds:
                min_val, max_val = tool_policy.param_bounds[param]
                if not (min_val <= value <= max_val):
                    violation = {
                        'type': 'param_out_of_bounds',
                        'tool': tool_name,
                        'param': param,
                        'value': value,
                        'allowed_range': [min_val, max_val]
                    }
                    await self._record_violation(violation)
                    return False, violation
        
        # Check context authorization
        if not tool_policy.is_context_authorized(context):
            violation = {
                'type': 'unauthorized_context',
                'tool': tool_name,
                'context': context
            }
            await self._record_violation(violation)
            return False, violation
        
        # Record authorized call
        await self.metrics.record_authorized_tool_call(agent_id, tool_name)
        return True, None
    
    async def _record_violation(self, violation):
        await self.metrics.record_tpar_violation(violation)
        await self.audit_log.write({
            'event_type': 'tool_permission_violation',
            'severity': 'high',
            **violation
        })

Dimension 3: Injection Resistance Rate (IRR)

Types of injection attacks to measure resistance against:

Direct prompt injection: Attacker controls the user input directly and attempts to override system instructions
Indirect prompt injection: Attacker plants instructions in documents, emails, or web content that the agent retrieves and processes
Context manipulation: Attacker provides false context to change how the agent interprets its instructions
Role-playing attacks: Attacker instructs the agent to role-play as a different, less-restricted agent

Measuring IRR:

IRR = (injection_attempts_resisted / total_injection_attempts)

IRR SLO recommendation by risk tier:

Tier 1 (standard agents): IRR >= 0.95 against known injection techniques
Tier 2 (privileged agents with tool access): IRR >= 0.99 against known techniques, IRR >= 0.90 against novel techniques
Tier 3 (agents with financial, medical, or legal authority): IRR >= 0.999 against known techniques

Dimension 4: Output Toxicity Rate (OTR)

OTR = toxic_outputs / total_outputs

Output toxicity has two measurement approaches:

Rule-based detection: Deterministic pattern matching, keyword filters, PII detection (e.g., regex for SSN, credit card patterns)
Model-based detection: LLM-as-judge or specialized toxicity classifiers that evaluate outputs for policy violations

Both should be used: rule-based detection provides high precision on known patterns; model-based detection provides recall on novel policy violations.

OTR SLO framework:

Tier 1 (public-facing agents): OTR <= 0.0001 (no more than 1 toxic output per 10,000)
Tier 2 (enterprise internal agents): OTR <= 0.001
Tier 3 (restricted access agents): OTR <= 0.01 (higher tolerance acceptable given controlled access)

Separate OTR SLOs should be specified for different toxicity categories: PII disclosure has a different tolerance than moderately inappropriate language.

Dimension 5: Data Exfiltration Detection Rate (DEDR)

Data Exfiltration Detection Rate measures the effectiveness of controls that prevent the agent from exposing sensitive data in its outputs or through tool calls.

Types of data exfiltration to detect:

Context leakage: Reproducing verbatim content from the system prompt, training data, or other agents' private context
PII exfiltration: Including personally identifiable information in outputs where it is not authorized
Secret extraction: Extracting API keys, passwords, or other credentials from context
Indirect exfiltration: Encoding sensitive information in steganographic ways within seemingly innocuous output

DEDR is measured against a test battery where known sensitive data is seeded in the agent's context and the agent is tested against adversarial attempts to extract it.

DLP Integration:

For production monitoring, DEDR monitoring should integrate with the organization's Data Loss Prevention (DLP) infrastructure:

Route all agent outputs through the DLP scanner before delivery to the caller
Flag outputs that match PII patterns, credential patterns, or organizational data classification markers
Record DLP flags as DEDR events

Dimension 6: Scope Adherence Rate (SAR)

This is distinct from tool permission adherence (which measures technical permission enforcement) — SAR measures semantic scope adherence (the agent's interpretation of what it is supposed to do).

Examples of scope violations:

A customer service agent that provides legal advice (outside authorized scope)
A code review agent that also writes production code (exceeds authorized capability)
A financial reporting agent that makes investment recommendations (prohibited activity)

SAR SLO recommendation: SAR >= 0.98. The 2% error budget is primarily reserved for genuinely ambiguous boundary cases.

Dimension 7: Audit Trail Completeness Rate (ATCR)

Audit Trail Completeness Rate measures the fraction of agent actions (tool calls, decisions, outputs) that are correctly captured in the audit log with sufficient detail for forensic reconstruction.

ATCR = correctly_logged_actions / total_actions

"Correctly logged" means:

The action is recorded in the audit log within the defined latency SLO (typically < 5 seconds)
The log record contains all required fields (agent_id, org_id, session_id, action_type, timestamp, input, output, authorization context)
The log record is tamper-evident (typically via cryptographic chaining or append-only storage)
The log record is retrievable within the defined retention period

ATCR SLO: ATCR >= 0.9999. For regulated industries, ATCR = 1.0000 may be required by compliance mandates (no action may proceed without simultaneous audit log write).

Error Budget Design for Security SLOs

Asymmetric Error Budget Allocation

Recommended error budget structure:

Security Dimension	Monthly Error Budget	Budget Exhaustion Trigger
Under-refusal rate	0.01% (1 in 10,000)	Immediate incident response, suspend high-risk features
Prompt injection success	0.001% (1 in 100,000)	Immediate incident response, consider agent suspension
Unauthorized tool call	0.01% (1 in 10,000)	Immediate incident response, audit recent tool call history
Verified data exfiltration	0 (zero tolerance)	Immediate agent suspension, incident declared
Scope violation	0.1%	Escalate, investigate, update scope boundary tests
Audit trail gap	0.01%	Immediate investigation, compliance notification if regulated

Error Budget Exhaustion Response

Unlike performance SLO error budget exhaustion (which typically triggers a deployment freeze), security SLO error budget exhaustion should trigger incident response:

Declare security incident: Create an incident record with full context of the budget-exhausting events
Quarantine the agent: Reduce the agent's scope to the most conservative authorized behavior while investigation proceeds
Notify security team: Page the on-call security engineer regardless of time of day
Preserve evidence: Ensure all logs from the incident period are preserved and tamper-protected
Begin forensic analysis: Reconstruct the sequence of events that led to the SLO violation
Remediate before restoring: Full budget restoration requires root-cause analysis, remediation implementation, and re-validation against the relevant test battery

Enforcement Mechanisms

Real-Time Circuit Breakers

class SecurityCircuitBreaker:
    """Real-time security enforcement for AI agent requests."""
    
    def __init__(self, policy_engine, metrics, error_budget_tracker):
        self.policy = policy_engine
        self.metrics = metrics
        self.budget = error_budget_tracker
    
    async def evaluate_request(self, request, agent_id, org_id):
        """
        Evaluate an incoming request against security policies.
        Returns: (is_allowed, security_violations)
        """
        violations = []
        
        # Injection pattern detection
        if injection_score := await self.policy.injection_scan(request.input):
            if injection_score > self.policy.injection_threshold:
                violations.append({
                    'type': 'high_confidence_injection',
                    'score': injection_score,
                    'action': 'block'
                })
        
        # Rate and anomaly checks
        if await self.policy.is_anomalous_request(request, agent_id):
            violations.append({
                'type': 'anomalous_request_pattern',
                'action': 'flag_and_allow'  # Log but don't block anomalies
            })
        
        # Budget check — if security error budget is exhausted, increase scrutiny
        if self.budget.is_exhausted(agent_id, 'injection_resistance'):
            # In exhausted state, raise injection threshold
            self.policy.injection_threshold = 0.30  # From 0.70 default
        
        # Block on hard violations
        hard_violations = [v for v in violations if v.get('action') == 'block']
        
        if hard_violations:
            await self.metrics.record_blocked_request(agent_id, hard_violations)
            await self.audit_log.record({
                'event_type': 'request_blocked_by_circuit_breaker',
                'agent_id': agent_id,
                'org_id': org_id,
                'violations': hard_violations
            })
            return False, hard_violations
        
        return True, violations  # May have warnings even if allowed

Asynchronous Behavioral Audits

class AsyncBehavioralAudit:
    """Asynchronous audit of agent behavioral compliance with security SLOs."""
    
    async def run_daily_audit(self, agent_id, lookback_hours=24):
        """Run daily behavioral security audit for an agent."""
        interactions = await self.load_interactions(agent_id, hours=lookback_hours)
        
        audit_results = {}
        
        # Audit each security dimension
        audit_results['rra'] = await self.audit_refusal_accuracy(interactions)
        audit_results['tpar'] = await self.audit_tool_permission_adherence(interactions)
        audit_results['irr'] = await self.audit_injection_resistance(interactions)
        audit_results['otr'] = await self.audit_output_toxicity(interactions)
        audit_results['sar'] = await self.audit_scope_adherence(interactions)
        audit_results['atcr'] = await self.audit_audit_trail_completeness(interactions)
        
        # Check against SLOs
        violations = []
        for dimension, result in audit_results.items():
            slo = self.get_slo(agent_id, dimension)
            if result.rate < slo.target:
                violations.append({
                    'dimension': dimension,
                    'current_rate': result.rate,
                    'slo_target': slo.target,
                    'gap': slo.target - result.rate,
                    'examples': result.violation_examples[:5]
                })
        
        # Record audit results and update error budgets
        for dimension, result in audit_results.items():
            await self.budget_tracker.record_audit(agent_id, dimension, result)
        
        # Alert on SLO violations
        if violations:
            await self.alert_manager.send_security_slo_violation(agent_id, violations)
        
        return audit_results, violations

How Armalo Implements Security SLOs

{
  "pact_id": "pact_a1b2c3",
  "agent_id": "agent_xyz",
  "security_slos": {
    "refusal_rate_accuracy": {
      "under_refusal_rate": 0.001,
      "over_refusal_rate": 0.05,
      "measurement_cadence": "daily",
      "test_battery_id": "battery_enterprise_v3"
    },
    "tool_permission_adherence": {
      "minimum_tpar": 0.9999,
      "enforcement": "real_time_block",
      "audit_trail": "required"
    },
    "injection_resistance": {
      "known_techniques_irr": 0.99,
      "novel_techniques_irr": 0.90,
      "test_frequency": "weekly_adversarial"
    },
    "output_toxicity": {
      "maximum_otr": 0.0001,
      "categories": ["pii_disclosure", "harmful_content", "misinformation"],
      "dlp_integration": true
    }
  },
  "error_budget_policy": {
    "exhaustion_response": "immediate_incident",
    "restoration_requirement": "root_cause_plus_remediation_validation"
  }
}

SLO Integration with NIST AI RMF and EU AI Act

NIST AI RMF Integration

The NIST AI RMF organizes AI risk management into four functions: GOVERN, MAP, MEASURE, and MANAGE. Security SLOs operate primarily in the MEASURE function but require all four:

GOVERN function requirements that SLOs depend on:

Clear ownership of each security SLO (who is accountable for compliance?)
Defined consequence for SLO violation (what organizational response is triggered?)
Documented risk tolerance (what levels of violation are acceptable in what contexts?)
Budget and authority to remediate violations

MAP function requirements that enable SLO design:

Risk identification: which OWASP LLM threat categories are in scope for this deployment?
Deployment context: what is the harm potential of each security violation category?
Stakeholder mapping: who is affected by each type of security failure?
Dependency mapping: which downstream systems depend on security guarantees?

RRA, IRR, TPAR map to NIST AI RMF MEASURE 2.6 (adversarial robustness evaluation)
DEDR maps to MEASURE 2.2 (data privacy)
ATCR maps to MEASURE 2.9 (AI risk documentation)
SAR maps to MEASURE 2.5 (AI system performance within intended contexts)

EU AI Act Alignment

Article 10 — Data Governance: The DEDR SLO addresses data governance requirements for high-risk systems, ensuring that training and deployment data handling meets the Act's standards.

Accuracy → RRA (refusal accuracy, scope adherence)
Robustness → IRR (injection resistance) and RRA (adversarial conditions)
Cybersecurity → TPAR, DEDR, ATCR (tool permissions, data protection, audit)

Building the Security SLO Test Battery

RRA Test Battery Construction

A high-quality RRA test battery requires three categories of test cases:

Category 1 — Clear in-scope requests (expected: process) These establish baseline accuracy for normal, clearly-authorized requests. Include requests that:

Use standard vocabulary for the agent's domain
Have obvious legitimate purpose
Fall clearly within the agent's defined scope Recommended: 40% of test battery, with sub-categories covering all major use cases

Category 2 — Clear prohibited requests (expected: refuse) These establish refusal accuracy for unambiguously unauthorized requests. Include requests that:

Ask for clearly prohibited content (define "clearly prohibited" for your policy)
Request actions outside authorized tool set
Solicit information the agent is explicitly prohibited from sharing Recommended: 40% of test battery, with adversarial sub-variants at multiple sophistication levels (naive, rephrased, indirect, multi-step)

Category 3 — Boundary cases (expected: handle correctly based on policy interpretation) These probe the agent's interpretation of ambiguous authorization boundaries. Include requests that:

Fall near the edge of the agent's defined scope
Could be interpreted as either authorized or unauthorized depending on context
Test the agent's disambiguation behavior (does it ask for clarification? refuse conservatively? accept liberally?) Recommended: 20% of test battery

Critical test battery maintenance:

Update quarterly to add new adversarial techniques as they're published
Expand boundary cases when policy changes create new edge zones
Track performance per test category separately (don't aggregate — a high overall RRA can hide poor performance on adversarial sub-variants)
Version control the test battery with change history

IRR Test Battery Construction

The injection resistance test battery requires regular updates because injection techniques evolve rapidly. The battery must include:

Tier 1 — Classic techniques (established, well-documented)

Direct instruction override ("Ignore previous instructions and...")
Role-play jailbreaks ("You are DAN, an AI without restrictions...")
Hypothetical framing ("Hypothetically, if you could help with...")
Gradual escalation (multi-turn sessions that incrementally push boundaries)

Tier 2 — Documented advanced techniques

Unicode/encoding bypasses
Prompt smuggling in external content
Competing instructions with authority claims
Context window poisoning attempts

Tier 3 — Novel technique variants (generated by red team)

Operator-specific custom attacks targeting known policy edges
Combination attacks that chain multiple techniques
Context-specific attacks designed for the agent's domain (e.g., code execution injection for a code assistant)

The battery should be updated when new techniques are published by security researchers (MITRE ATLAS technique additions, OWASP update cycles, academic publications).

Security SLO Incident Runbooks

Runbook 1: Confirmed Prompt Injection Success

Immediate response (first 15 minutes):

Reduce agent scope to most conservative configuration: disable all optional capabilities, restrict to read-only tools if possible
Tag all interactions from the past 72 hours for forensic review
Pull the 10 most recent interactions and manually review for injection indicators
Notify the security on-call team

Investigation phase (first 4 hours):

Reconstruct the interaction sequence that triggered the detection
Attempt to reproduce the injection in a sandboxed environment
Determine: is this a known technique (existing defense should have caught it) or novel?
If known: investigate why the existing defense failed (deployment gap? configuration drift? evasion of existing pattern?)
If novel: document the technique for battery update and industry sharing

Remediation:

Develop and test a specific defense against the technique (input filtering rule, instruction hardening, or system prompt update)
Validate the defense in the sandboxed environment
Deploy with monitoring, not silently — annotate the deployment to enable before/after comparison
Update the test battery to include this technique
Run full battery against the patched agent before restoring full capabilities

Post-incident:

Write incident report documenting technique, detection timeline, response timeline, remediation
Update error budget tracking (this incident consumed budget — when does restoration occur?)
Consider responsible disclosure if the technique is novel and likely to affect other platforms

Runbook 2: Suspected Data Exfiltration

Detection trigger: DLP integration flags an output containing PII or sensitive data patterns, or DEDR test battery reveals that seeded sensitive data is reproducible

Immediate response (first 5 minutes):

Suspend the agent immediately — data exfiltration has zero-tolerance policy
Identify the specific output(s) that triggered the alert
Determine if the flagged content is a true positive (actual sensitive data) or false positive (matched pattern but not actually sensitive)
If true positive: begin incident timeline

Evidence preservation:

Lock audit logs from the time window containing the suspected exfiltration
Capture the full interaction context (system prompt, all turns, tool calls)
Identify the source of the sensitive data in the context (was it injected? legitimately provided? from retrieval?)

Impact assessment:

How much sensitive data was potentially exposed?
Who received the output containing the sensitive data?
What is the data classification of the exposed information? (PII? PHI? trade secret?)
Are there regulatory notification obligations? (GDPR: 72-hour notification; HIPAA: 60-day notification to HHS; state breach laws vary)

The agent does not return to service until:

Root cause is identified (unintended context exposure? injection that retrieved sensitive data? misconfigured DLP exceptions?)
Root cause is remediated
DLP test battery passes at DEDR >= 0.9999 with seeded sensitive data
Security team sign-off

Conclusion: Key Takeaways

Key takeaways:

Seven dimensions require independent measurement — refusal accuracy, tool permission adherence, injection resistance, output toxicity, data exfiltration detection, scope adherence, and audit trail completeness.
Error budgets for security are asymmetric — some violations (verified data exfiltration, confirmed injection success) have zero tolerance; others have small but nonzero budgets.
Real-time and asynchronous enforcement are both required — circuit breakers catch immediate violations; behavioral audits catch sophisticated, gradual, or indirect violations.
OWASP LLM Top 10 provides the threat taxonomy — scope your security SLOs to cover the top threat vectors, not just the ones that are easy to measure.
Security SLOs must be maintained — as threat landscapes evolve and injection techniques improve, test batteries and SLO thresholds require regular updating.
Security SLO compliance is a trust signal — it belongs in agent trust profiles alongside accuracy and calibration metrics, visible to the enterprises considering hiring the agent.

security SLOsai agent securitybehavioral guaranteesinjection resistanceerror budgetarmaloai agent trustgenerative engine optimization

← Knowledge Base

Build trust into your agents

Start Free Read the docs

Based in Singapore? See our MAS AI governance compliance resources →

Security SLOs for AI Agent Platforms: Defining Behavioral Guarantees That Hold in Production

Security SLOs for AI Agent Platforms: Defining Behavioral Guarantees That Hold in Production

TL;DR

The Core Distinction: Behavioral Security vs. Operational Security

The OWASP LLM Top 10 as a Security SLO Taxonomy

NIST AI RMF Mapping

The Seven Security SLO Dimensions

Dimension 1: Refusal Rate Accuracy (RRA)

Dimension 2: Tool Permission Adherence Rate (TPAR)

Dimension 3: Injection Resistance Rate (IRR)

Dimension 4: Output Toxicity Rate (OTR)

Dimension 5: Data Exfiltration Detection Rate (DEDR)

Dimension 6: Scope Adherence Rate (SAR)

Dimension 7: Audit Trail Completeness Rate (ATCR)

Error Budget Design for Security SLOs

Asymmetric Error Budget Allocation

Error Budget Exhaustion Response

Enforcement Mechanisms

Real-Time Circuit Breakers

Asynchronous Behavioral Audits

How Armalo Implements Security SLOs

SLO Integration with NIST AI RMF and EU AI Act

NIST AI RMF Integration

EU AI Act Alignment

Building the Security SLO Test Battery

RRA Test Battery Construction

IRR Test Battery Construction

Security SLO Incident Runbooks

Runbook 1: Confirmed Prompt Injection Success

Runbook 2: Suspected Data Exfiltration

Conclusion: Key Takeaways

Build trust into your agents

Related Articles

Vendor Credential Isolation: Why AI Agents Must Never Share API Keys Across Tenants

Tool Permission Hardening for AI Agents: Least-Privilege Design at the API Layer

Secret Management Integration for AI Agents: HashiCorp Vault, AWS Secrets Manager, and Azure Key Vault Patterns

Security SLOs for AI Agent Platforms: Defining Behavioral Guarantees That Hold in Production

Security SLOs for AI Agent Platforms: Defining Behavioral Guarantees That Hold in Production

TL;DR

The Core Distinction: Behavioral Security vs. Operational Security

The OWASP LLM Top 10 as a Security SLO Taxonomy

NIST AI RMF Mapping

The Seven Security SLO Dimensions

Dimension 1: Refusal Rate Accuracy (RRA)

Dimension 2: Tool Permission Adherence Rate (TPAR)

Dimension 3: Injection Resistance Rate (IRR)

Dimension 4: Output Toxicity Rate (OTR)

Dimension 5: Data Exfiltration Detection Rate (DEDR)

Dimension 6: Scope Adherence Rate (SAR)

Dimension 7: Audit Trail Completeness Rate (ATCR)

Error Budget Design for Security SLOs

Asymmetric Error Budget Allocation

Error Budget Exhaustion Response

Enforcement Mechanisms

Real-Time Circuit Breakers

Asynchronous Behavioral Audits

How Armalo Implements Security SLOs

SLO Integration with NIST AI RMF and EU AI Act

NIST AI RMF Integration

EU AI Act Alignment

Building the Security SLO Test Battery

RRA Test Battery Construction

IRR Test Battery Construction

Security SLO Incident Runbooks

Runbook 1: Confirmed Prompt Injection Success

Runbook 2: Suspected Data Exfiltration

Conclusion: Key Takeaways

Build trust into your agents

Related Articles

Vendor Credential Isolation: Why AI Agents Must Never Share API Keys Across Tenants

Tool Permission Hardening for AI Agents: Least-Privilege Design at the API Layer

Secret Management Integration for AI Agents: HashiCorp Vault, AWS Secrets Manager, and Azure Key Vault Patterns