Security SLOs for AI Agent Platforms: Defining Behavioral Guarantees That Hold in Production
A practitioner's guide to Security Service Level Objectives for AI agent systems — refusal rate accuracy, tool permission adherence, injection resistance, output toxicity, data exfiltration detection, error budgets, and enforcement mechanisms.
Security SLOs for AI Agent Platforms: Defining Behavioral Guarantees That Hold in Production
Service Level Objectives for traditional software systems measure availability, latency, and throughput. These metrics are important for AI agent platforms too — but they are insufficient. An AI agent can be fully available, responding in 200ms, processing thousands of requests per hour, and simultaneously leaking sensitive data, following malicious instructions injected via prompt, and executing tools beyond its authorized scope.
Security SLOs for AI agent systems measure behavioral guarantees, not just operational ones. They answer the question: does the agent behave within its security boundaries, reliably, under adversarial conditions, over time? This is a fundamentally different measurement problem from availability SLOs, requiring different metrics, different error budgets, and different enforcement mechanisms.
This document provides the complete specification for security SLOs in AI agent platforms: the dimensions that require measurement, how to define and measure each metric, how to set error budgets, how to enforce security SLOs in production, and how to trigger incident response when SLOs are violated. It draws on NIST AI RMF, OWASP Top 10 for LLM Applications, and real-world deployment patterns.
TL;DR
- Security SLOs for AI agents measure behavioral security properties — refusal accuracy, permission adherence, injection resistance, output safety, data handling — not just uptime and latency
- Each security dimension requires a distinct measurement methodology; a single "security score" obscures more than it reveals
- Error budgets for security SLOs should be asymmetric — budget exhaustion triggers immediate response, not just a maintenance window
- OWASP Top 10 for LLM Applications provides the threat taxonomy for scoping security SLO dimensions
- NIST AI RMF's GOVERN, MAP, MEASURE, MANAGE framework maps directly to security SLO lifecycle
- Enforcement mechanisms range from real-time circuit breakers to asynchronous behavioral audits
- Armalo's behavioral pact framework implements security SLOs as verifiable contractual commitments with cryptographic attestation
The Core Distinction: Behavioral Security vs. Operational Security
Traditional security monitoring for software platforms measures infrastructure security: vulnerability patching cadence, authentication event rates, firewall rule violations, certificate expiry. These remain relevant for AI agent platforms but are insufficient because they don't measure the security properties unique to AI agents: their behavior under adversarial inputs, their adherence to authorization boundaries, and their reliability in refusing prohibited actions.
The OWASP LLM Top 10 as a Security SLO Taxonomy
The OWASP Top 10 for LLM Applications (2025 edition) provides the threat taxonomy that security SLOs should address:
- LLM01: Prompt Injection — Attacks that override system instructions through crafted inputs
- LLM02: Insecure Output Handling — Agent outputs used unsafely by downstream systems
- LLM03: Training Data Poisoning — Compromised training or fine-tuning data
- LLM04: Model Denial of Service — Resource exhaustion through adversarial inputs
- LLM05: Supply Chain Vulnerabilities — Compromised model components or tools
- LLM06: Sensitive Information Disclosure — Extraction of training data or confidential context
- LLM07: Insecure Plugin Design — Unsafe tool execution
- LLM08: Excessive Agency — Agent taking actions beyond its authorized scope
- LLM09: Overreliance — Lack of appropriate skepticism in consuming agent outputs
- LLM10: Model Theft — Intellectual property extraction through query patterns
Security SLOs directly address LLM01, LLM02, LLM06, LLM07, and LLM08. Partial coverage of LLM04 and LLM10.
NIST AI RMF Mapping
The NIST AI Risk Management Framework (AI RMF 1.0, January 2023) organizes AI risk management into four functions:
- GOVERN: Establishing accountability, policies, and processes
- MAP: Identifying and categorizing risks
- MEASURE: Implementing metrics and evaluation mechanisms
- MANAGE: Responding to identified risks
Security SLOs are a MEASURE function artifact. They make risk measurement operational and quantitative. But they require GOVERN (clear ownership of security SLO compliance), MAP (risk identification to determine which SLO dimensions are relevant), and MANAGE (response protocols when SLOs are violated) to be complete.
The Seven Security SLO Dimensions
Security SLOs for AI agent platforms should cover seven distinct behavioral dimensions. Each dimension requires independent measurement because failures in one dimension often don't correlate with failures in others.
Dimension 1: Refusal Rate Accuracy (RRA)
Refusal Rate Accuracy measures the agent's ability to correctly refuse prohibited requests while correctly accepting authorized requests. It is essentially the accuracy of the agent's authorization decision function.
Definition:
- True Positive (TP): Correctly refusing a prohibited request
- True Negative (TN): Correctly accepting an authorized request
- False Positive (FP): Incorrectly refusing an authorized request (over-refusal)
- False Negative (FN): Incorrectly accepting a prohibited request (under-refusal)
RRA = (TP + TN) / (TP + TN + FP + FN)
For security SLOs, False Negatives (under-refusal) are far more dangerous than False Positives (over-refusal). The SLO should be specified separately for each:
- Under-refusal rate SLO: FN / (FN + TP) < 0.001 (99.9% of prohibited requests are correctly refused)
- Over-refusal rate SLO: FP / (FP + TN) < 0.05 (no more than 5% of authorized requests are incorrectly refused)
Measuring RRA in production:
Continuous RRA measurement requires a test set of prohibited and authorized requests that is executed against the agent regularly. This test set must:
- Be kept current with the agent's evolving policy (as policies change, the test set must be updated)
- Include adversarial variants of prohibited requests (jailbreak attempts, indirect injection, rephrasing attacks)
- Include boundary cases near the edge of authorization scope
- Be representative of the actual distribution of requests the agent will receive
Recommended cadence: run the RRA test battery daily (or hourly for high-risk deployments) and track the time series. A single instance of under-refusal is worth investigating; a rising trend indicates systematic policy enforcement degradation.
Dimension 2: Tool Permission Adherence Rate (TPAR)
Tool Permission Adherence Rate measures how reliably the agent uses only the tools it is authorized to use and only within the parameters allowed by its authorization policy.
Definition: TPAR = 1 - (unauthorized_tool_calls / total_tool_calls)
Where unauthorized tool calls include:
- Calling a tool not in the agent's authorized tool list
- Calling an authorized tool with parameters outside the allowed range (e.g., data export with more records than allowed)
- Calling an authorized tool in an unauthorized context (e.g., a tool that is only authorized for read operations being called for write operations)
- Chaining tool calls in sequences that are individually authorized but collectively constitute an unauthorized action
Implementing TPAR monitoring:
TPAR monitoring intercepts all tool calls at the agent runtime layer, before the tool executes. Each tool call is evaluated against the authorization policy:
class ToolPermissionInterceptor:
"""Intercepts and logs tool calls for TPAR monitoring."""
def __init__(self, authorization_policy, metrics_client):
self.policy = authorization_policy
self.metrics = metrics_client
async def intercept(self, agent_id, tool_name, tool_params, context):
"""
Intercept a tool call and evaluate against authorization policy.
Returns: (is_authorized, violation_details)
"""
# Check if tool is in authorized list
if tool_name not in self.policy.authorized_tools(agent_id):
violation = {
'type': 'unauthorized_tool',
'tool': tool_name,
'agent_id': agent_id,
'context': context
}
await self._record_violation(violation)
return False, violation
# Check parameter bounds
tool_policy = self.policy.tool_policy(agent_id, tool_name)
for param, value in tool_params.items():
if param in tool_policy.param_bounds:
min_val, max_val = tool_policy.param_bounds[param]
if not (min_val <= value <= max_val):
violation = {
'type': 'param_out_of_bounds',
'tool': tool_name,
'param': param,
'value': value,
'allowed_range': [min_val, max_val]
}
await self._record_violation(violation)
return False, violation
# Check context authorization
if not tool_policy.is_context_authorized(context):
violation = {
'type': 'unauthorized_context',
'tool': tool_name,
'context': context
}
await self._record_violation(violation)
return False, violation
# Record authorized call
await self.metrics.record_authorized_tool_call(agent_id, tool_name)
return True, None
async def _record_violation(self, violation):
await self.metrics.record_tpar_violation(violation)
await self.audit_log.write({
'event_type': 'tool_permission_violation',
'severity': 'high',
**violation
})
TPAR SLO recommendation: TPAR >= 0.9999 (at most 1 unauthorized tool call per 10,000 tool calls). For agents with privileged tools (file system access, external API calls, financial operations), TPAR >= 0.99999.
Dimension 3: Injection Resistance Rate (IRR)
Injection Resistance Rate measures the agent's ability to maintain its authorized behavior under adversarial prompt injection attempts. Prompt injection is consistently the highest-severity attack vector against AI agents (OWASP LLM01).
Types of injection attacks to measure resistance against:
- Direct prompt injection: Attacker controls the user input directly and attempts to override system instructions
- Indirect prompt injection: Attacker plants instructions in documents, emails, or web content that the agent retrieves and processes
- Context manipulation: Attacker provides false context to change how the agent interprets its instructions
- Role-playing attacks: Attacker instructs the agent to role-play as a different, less-restricted agent
Measuring IRR:
IRR requires an adversarial injection test battery — a collection of injection attempts at multiple sophistication levels. The test battery should be updated regularly as new injection techniques are discovered.
IRR = (injection_attempts_resisted / total_injection_attempts)
IRR SLO recommendation by risk tier:
- Tier 1 (standard agents): IRR >= 0.95 against known injection techniques
- Tier 2 (privileged agents with tool access): IRR >= 0.99 against known techniques, IRR >= 0.90 against novel techniques
- Tier 3 (agents with financial, medical, or legal authority): IRR >= 0.999 against known techniques
The injection test battery should be divided into "known" (techniques documented in published research and ATLAS) and "novel" (new techniques developed by the operator's red team) with separate SLOs for each category.
Dimension 4: Output Toxicity Rate (OTR)
Output Toxicity Rate measures the frequency at which the agent generates outputs that violate content safety policies — including harmful content, hate speech, personally identifiable information disclosure, misinformation, and other categories defined by the operator's content policy.
OTR = toxic_outputs / total_outputs
Output toxicity has two measurement approaches:
- Rule-based detection: Deterministic pattern matching, keyword filters, PII detection (e.g., regex for SSN, credit card patterns)
- Model-based detection: LLM-as-judge or specialized toxicity classifiers that evaluate outputs for policy violations
Both should be used: rule-based detection provides high precision on known patterns; model-based detection provides recall on novel policy violations.
OTR SLO framework:
- Tier 1 (public-facing agents): OTR <= 0.0001 (no more than 1 toxic output per 10,000)
- Tier 2 (enterprise internal agents): OTR <= 0.001
- Tier 3 (restricted access agents): OTR <= 0.01 (higher tolerance acceptable given controlled access)
Separate OTR SLOs should be specified for different toxicity categories: PII disclosure has a different tolerance than moderately inappropriate language.
Dimension 5: Data Exfiltration Detection Rate (DEDR)
Data Exfiltration Detection Rate measures the effectiveness of controls that prevent the agent from exposing sensitive data in its outputs or through tool calls.
Types of data exfiltration to detect:
- Context leakage: Reproducing verbatim content from the system prompt, training data, or other agents' private context
- PII exfiltration: Including personally identifiable information in outputs where it is not authorized
- Secret extraction: Extracting API keys, passwords, or other credentials from context
- Indirect exfiltration: Encoding sensitive information in steganographic ways within seemingly innocuous output
DEDR is measured against a test battery where known sensitive data is seeded in the agent's context and the agent is tested against adversarial attempts to extract it.
DLP Integration:
For production monitoring, DEDR monitoring should integrate with the organization's Data Loss Prevention (DLP) infrastructure:
- Route all agent outputs through the DLP scanner before delivery to the caller
- Flag outputs that match PII patterns, credential patterns, or organizational data classification markers
- Record DLP flags as DEDR events
DEDR SLO: DEDR (fraction of exfiltration attempts that are detected) >= 0.99. Crucially, the SLO should specify both the detection rate AND the maximum allowable false negative rate: no more than 1 undetected exfiltration event per 10,000 exfiltration attempts.
Dimension 6: Scope Adherence Rate (SAR)
Scope Adherence Rate measures how reliably the agent stays within its defined operational scope — answering only questions it is authorized to answer, taking only actions it is authorized to take, and refusing requests that fall outside its defined purpose.
This is distinct from tool permission adherence (which measures technical permission enforcement) — SAR measures semantic scope adherence (the agent's interpretation of what it is supposed to do).
Examples of scope violations:
- A customer service agent that provides legal advice (outside authorized scope)
- A code review agent that also writes production code (exceeds authorized capability)
- A financial reporting agent that makes investment recommendations (prohibited activity)
SAR measurement requires a scope boundary test set: a collection of in-scope and out-of-scope requests where the correct behavior (process or refuse) is clearly defined. SAR = fraction of these requests handled correctly.
SAR SLO recommendation: SAR >= 0.98. The 2% error budget is primarily reserved for genuinely ambiguous boundary cases.
Dimension 7: Audit Trail Completeness Rate (ATCR)
Audit Trail Completeness Rate measures the fraction of agent actions (tool calls, decisions, outputs) that are correctly captured in the audit log with sufficient detail for forensic reconstruction.
ATCR = correctly_logged_actions / total_actions
"Correctly logged" means:
- The action is recorded in the audit log within the defined latency SLO (typically < 5 seconds)
- The log record contains all required fields (agent_id, org_id, session_id, action_type, timestamp, input, output, authorization context)
- The log record is tamper-evident (typically via cryptographic chaining or append-only storage)
- The log record is retrievable within the defined retention period
ATCR SLO: ATCR >= 0.9999. For regulated industries, ATCR = 1.0000 may be required by compliance mandates (no action may proceed without simultaneous audit log write).
Error Budget Design for Security SLOs
Error budgets for performance SLOs (uptime, latency) operate on a monthly cycle: you have a budget of allowed failures, and when it's exhausted, you freeze new deployments until reliability is restored. Security SLO error budgets require different design.
Asymmetric Error Budget Allocation
Security SLOs should use asymmetric error budgets: the error budget for under-refusal and injection failure is much smaller than for over-refusal and operational latency. A single prompt injection success that leads to data exfiltration may be more costly than 1,000 instances of over-refusal.
Recommended error budget structure:
| Security Dimension | Monthly Error Budget | Budget Exhaustion Trigger |
|---|---|---|
| Under-refusal rate | 0.01% (1 in 10,000) | Immediate incident response, suspend high-risk features |
| Prompt injection success | 0.001% (1 in 100,000) | Immediate incident response, consider agent suspension |
| Unauthorized tool call | 0.01% (1 in 10,000) | Immediate incident response, audit recent tool call history |
| Verified data exfiltration | 0 (zero tolerance) | Immediate agent suspension, incident declared |
| Scope violation | 0.1% | Escalate, investigate, update scope boundary tests |
| Audit trail gap | 0.01% | Immediate investigation, compliance notification if regulated |
Error Budget Exhaustion Response
Unlike performance SLO error budget exhaustion (which typically triggers a deployment freeze), security SLO error budget exhaustion should trigger incident response:
- Declare security incident: Create an incident record with full context of the budget-exhausting events
- Quarantine the agent: Reduce the agent's scope to the most conservative authorized behavior while investigation proceeds
- Notify security team: Page the on-call security engineer regardless of time of day
- Preserve evidence: Ensure all logs from the incident period are preserved and tamper-protected
- Begin forensic analysis: Reconstruct the sequence of events that led to the SLO violation
- Remediate before restoring: Full budget restoration requires root-cause analysis, remediation implementation, and re-validation against the relevant test battery
Enforcement Mechanisms
Security SLO enforcement has three modes: real-time blocking (preventing violations before they occur), asynchronous detection (identifying violations after the fact for remediation), and behavioral monitoring (detecting patterns that predict future violations).
Real-Time Circuit Breakers
For security dimensions where violations have immediate consequences (tool permission adherence, injection resistance), real-time enforcement via circuit breakers is required. Circuit breakers sit in the agent request path and block requests that would violate security policy:
class SecurityCircuitBreaker:
"""Real-time security enforcement for AI agent requests."""
def __init__(self, policy_engine, metrics, error_budget_tracker):
self.policy = policy_engine
self.metrics = metrics
self.budget = error_budget_tracker
async def evaluate_request(self, request, agent_id, org_id):
"""
Evaluate an incoming request against security policies.
Returns: (is_allowed, security_violations)
"""
violations = []
# Injection pattern detection
if injection_score := await self.policy.injection_scan(request.input):
if injection_score > self.policy.injection_threshold:
violations.append({
'type': 'high_confidence_injection',
'score': injection_score,
'action': 'block'
})
# Rate and anomaly checks
if await self.policy.is_anomalous_request(request, agent_id):
violations.append({
'type': 'anomalous_request_pattern',
'action': 'flag_and_allow' # Log but don't block anomalies
})
# Budget check — if security error budget is exhausted, increase scrutiny
if self.budget.is_exhausted(agent_id, 'injection_resistance'):
# In exhausted state, raise injection threshold
self.policy.injection_threshold = 0.30 # From 0.70 default
# Block on hard violations
hard_violations = [v for v in violations if v.get('action') == 'block']
if hard_violations:
await self.metrics.record_blocked_request(agent_id, hard_violations)
await self.audit_log.record({
'event_type': 'request_blocked_by_circuit_breaker',
'agent_id': agent_id,
'org_id': org_id,
'violations': hard_violations
})
return False, hard_violations
return True, violations # May have warnings even if allowed
Asynchronous Behavioral Audits
Real-time circuit breakers can't catch all violations — some are only detectable in retrospect (scope creep, gradual authorization erosion, sophisticated indirect injection). Asynchronous behavioral audits analyze batches of recent agent interactions for violations:
class AsyncBehavioralAudit:
"""Asynchronous audit of agent behavioral compliance with security SLOs."""
async def run_daily_audit(self, agent_id, lookback_hours=24):
"""Run daily behavioral security audit for an agent."""
interactions = await self.load_interactions(agent_id, hours=lookback_hours)
audit_results = {}
# Audit each security dimension
audit_results['rra'] = await self.audit_refusal_accuracy(interactions)
audit_results['tpar'] = await self.audit_tool_permission_adherence(interactions)
audit_results['irr'] = await self.audit_injection_resistance(interactions)
audit_results['otr'] = await self.audit_output_toxicity(interactions)
audit_results['sar'] = await self.audit_scope_adherence(interactions)
audit_results['atcr'] = await self.audit_audit_trail_completeness(interactions)
# Check against SLOs
violations = []
for dimension, result in audit_results.items():
slo = self.get_slo(agent_id, dimension)
if result.rate < slo.target:
violations.append({
'dimension': dimension,
'current_rate': result.rate,
'slo_target': slo.target,
'gap': slo.target - result.rate,
'examples': result.violation_examples[:5]
})
# Record audit results and update error budgets
for dimension, result in audit_results.items():
await self.budget_tracker.record_audit(agent_id, dimension, result)
# Alert on SLO violations
if violations:
await self.alert_manager.send_security_slo_violation(agent_id, violations)
return audit_results, violations
How Armalo Implements Security SLOs
Security SLOs in Armalo are first-class components of behavioral pacts. When an agent operator registers an agent on the Armalo platform, they define security SLO commitments as part of the agent's pact:
{
"pact_id": "pact_a1b2c3",
"agent_id": "agent_xyz",
"security_slos": {
"refusal_rate_accuracy": {
"under_refusal_rate": 0.001,
"over_refusal_rate": 0.05,
"measurement_cadence": "daily",
"test_battery_id": "battery_enterprise_v3"
},
"tool_permission_adherence": {
"minimum_tpar": 0.9999,
"enforcement": "real_time_block",
"audit_trail": "required"
},
"injection_resistance": {
"known_techniques_irr": 0.99,
"novel_techniques_irr": 0.90,
"test_frequency": "weekly_adversarial"
},
"output_toxicity": {
"maximum_otr": 0.0001,
"categories": ["pii_disclosure", "harmful_content", "misinformation"],
"dlp_integration": true
}
},
"error_budget_policy": {
"exhaustion_response": "immediate_incident",
"restoration_requirement": "root_cause_plus_remediation_validation"
}
}
Armalo continuously monitors compliance with these security SLO commitments and updates the agent's trust score accordingly. Security SLO compliance is one of the highest-weighted components in Armalo's composite trust score because security violations represent the highest potential harm to the enterprises and users relying on the agent.
When security SLOs are violated, Armalo records a trust impact event, notifies the agent operator, and surfaces the violation on the agent's public trust profile. The magnitude and duration of the violation determine the trust score impact — brief, quickly-remediated violations have smaller impacts than persistent, unaddressed violations.
For the Armalo marketplace, security SLO compliance history is a filter criterion. Enterprises can specify minimum security SLO thresholds as hiring requirements and see only agents whose historical compliance and current commitments meet those requirements.
SLO Integration with NIST AI RMF and EU AI Act
Security SLOs don't exist in a regulatory vacuum. Two major frameworks — the NIST AI Risk Management Framework and the EU AI Act — provide complementary structures that security SLOs should align with. Understanding the alignment helps organizations avoid doing compliance work twice and ensures their SLO framework addresses the risks regulators most care about.
NIST AI RMF Integration
The NIST AI RMF organizes AI risk management into four functions: GOVERN, MAP, MEASURE, and MANAGE. Security SLOs operate primarily in the MEASURE function but require all four:
GOVERN function requirements that SLOs depend on:
- Clear ownership of each security SLO (who is accountable for compliance?)
- Defined consequence for SLO violation (what organizational response is triggered?)
- Documented risk tolerance (what levels of violation are acceptable in what contexts?)
- Budget and authority to remediate violations
MAP function requirements that enable SLO design:
- Risk identification: which OWASP LLM threat categories are in scope for this deployment?
- Deployment context: what is the harm potential of each security violation category?
- Stakeholder mapping: who is affected by each type of security failure?
- Dependency mapping: which downstream systems depend on security guarantees?
MEASURE function — where SLOs live: The NIST AI RMF MEASURE function calls for "identifying metrics, methods, and tools to measure the degree to which risks are known, manageable, and managed." The seven security SLO dimensions described above directly address this requirement:
- RRA, IRR, TPAR map to NIST AI RMF MEASURE 2.6 (adversarial robustness evaluation)
- DEDR maps to MEASURE 2.2 (data privacy)
- ATCR maps to MEASURE 2.9 (AI risk documentation)
- SAR maps to MEASURE 2.5 (AI system performance within intended contexts)
MANAGE function — SLO enforcement and response: The MANAGE function requires treating identified risks — which SLO violations evidence. The incident response protocol triggered by SLO budget exhaustion is the MANAGE function instantiated for operational AI security.
EU AI Act Alignment
The EU AI Act applies specifically to high-risk AI systems (Annex III: education, employment, critical infrastructure, law enforcement, migration management, biometric ID, and others). For high-risk AI systems, the Act requires:
Article 9 — Risk Management System: Security SLOs directly implement the Act's requirement for "a continuous iterative process run throughout the entire lifecycle of a high-risk AI system." The SLO framework with regular measurement, error budget tracking, and triggered response is exactly the risk management system the Act envisions.
Article 10 — Data Governance: The DEDR SLO addresses data governance requirements for high-risk systems, ensuring that training and deployment data handling meets the Act's standards.
Article 15 — Accuracy, Robustness, and Cybersecurity: This is the most direct EU AI Act mapping. Article 15 requires "high-risk AI systems [to] achieve, in the light of their intended purpose, an appropriate level of accuracy, robustness and cybersecurity." The SLO framework operationalizes Article 15 compliance:
- Accuracy → RRA (refusal accuracy, scope adherence)
- Robustness → IRR (injection resistance) and RRA (adversarial conditions)
- Cybersecurity → TPAR, DEDR, ATCR (tool permissions, data protection, audit)
Article 14 — Human Oversight: The audit trail completeness SLO (ATCR) directly supports human oversight requirements by ensuring that all agent actions are logged and reviewable. EU AI Act Article 14 requires that high-risk AI systems "allow for human oversight, including in particular the ability to interrupt, stop or override them" — which requires comprehensive audit logs.
Building the Security SLO Test Battery
The quality of a security SLO measurement depends entirely on the quality of the test battery underlying it. Poorly designed test batteries produce false confidence; adversarially diverse batteries reveal genuine security posture. The following guidance covers test battery design for each of the seven SLO dimensions.
RRA Test Battery Construction
A high-quality RRA test battery requires three categories of test cases:
Category 1 — Clear in-scope requests (expected: process) These establish baseline accuracy for normal, clearly-authorized requests. Include requests that:
- Use standard vocabulary for the agent's domain
- Have obvious legitimate purpose
- Fall clearly within the agent's defined scope Recommended: 40% of test battery, with sub-categories covering all major use cases
Category 2 — Clear prohibited requests (expected: refuse) These establish refusal accuracy for unambiguously unauthorized requests. Include requests that:
- Ask for clearly prohibited content (define "clearly prohibited" for your policy)
- Request actions outside authorized tool set
- Solicit information the agent is explicitly prohibited from sharing Recommended: 40% of test battery, with adversarial sub-variants at multiple sophistication levels (naive, rephrased, indirect, multi-step)
Category 3 — Boundary cases (expected: handle correctly based on policy interpretation) These probe the agent's interpretation of ambiguous authorization boundaries. Include requests that:
- Fall near the edge of the agent's defined scope
- Could be interpreted as either authorized or unauthorized depending on context
- Test the agent's disambiguation behavior (does it ask for clarification? refuse conservatively? accept liberally?) Recommended: 20% of test battery
Critical test battery maintenance:
- Update quarterly to add new adversarial techniques as they're published
- Expand boundary cases when policy changes create new edge zones
- Track performance per test category separately (don't aggregate — a high overall RRA can hide poor performance on adversarial sub-variants)
- Version control the test battery with change history
IRR Test Battery Construction
The injection resistance test battery requires regular updates because injection techniques evolve rapidly. The battery must include:
Tier 1 — Classic techniques (established, well-documented)
- Direct instruction override ("Ignore previous instructions and...")
- Role-play jailbreaks ("You are DAN, an AI without restrictions...")
- Hypothetical framing ("Hypothetically, if you could help with...")
- Gradual escalation (multi-turn sessions that incrementally push boundaries)
Tier 2 — Documented advanced techniques
- Unicode/encoding bypasses
- Prompt smuggling in external content
- Competing instructions with authority claims
- Context window poisoning attempts
Tier 3 — Novel technique variants (generated by red team)
- Operator-specific custom attacks targeting known policy edges
- Combination attacks that chain multiple techniques
- Context-specific attacks designed for the agent's domain (e.g., code execution injection for a code assistant)
The battery should be updated when new techniques are published by security researchers (MITRE ATLAS technique additions, OWASP update cycles, academic publications).
Security SLO Incident Runbooks
When security SLO budgets are exhausted or severe violations are detected, practitioners need runbooks — not just principles. The following incident runbooks address the three most common security SLO violation scenarios.
Runbook 1: Confirmed Prompt Injection Success
Detection trigger: IRR test battery identifies a novel injection technique that succeeds against the deployed agent, or production monitoring detects characteristics consistent with successful injection (unusual tool calls, out-of-scope content, behavioral pattern change)
Immediate response (first 15 minutes):
- Reduce agent scope to most conservative configuration: disable all optional capabilities, restrict to read-only tools if possible
- Tag all interactions from the past 72 hours for forensic review
- Pull the 10 most recent interactions and manually review for injection indicators
- Notify the security on-call team
Investigation phase (first 4 hours):
- Reconstruct the interaction sequence that triggered the detection
- Attempt to reproduce the injection in a sandboxed environment
- Determine: is this a known technique (existing defense should have caught it) or novel?
- If known: investigate why the existing defense failed (deployment gap? configuration drift? evasion of existing pattern?)
- If novel: document the technique for battery update and industry sharing
Remediation:
- Develop and test a specific defense against the technique (input filtering rule, instruction hardening, or system prompt update)
- Validate the defense in the sandboxed environment
- Deploy with monitoring, not silently — annotate the deployment to enable before/after comparison
- Update the test battery to include this technique
- Run full battery against the patched agent before restoring full capabilities
Post-incident:
- Write incident report documenting technique, detection timeline, response timeline, remediation
- Update error budget tracking (this incident consumed budget — when does restoration occur?)
- Consider responsible disclosure if the technique is novel and likely to affect other platforms
Runbook 2: Suspected Data Exfiltration
Detection trigger: DLP integration flags an output containing PII or sensitive data patterns, or DEDR test battery reveals that seeded sensitive data is reproducible
Immediate response (first 5 minutes):
- Suspend the agent immediately — data exfiltration has zero-tolerance policy
- Identify the specific output(s) that triggered the alert
- Determine if the flagged content is a true positive (actual sensitive data) or false positive (matched pattern but not actually sensitive)
- If true positive: begin incident timeline
Evidence preservation:
- Lock audit logs from the time window containing the suspected exfiltration
- Capture the full interaction context (system prompt, all turns, tool calls)
- Identify the source of the sensitive data in the context (was it injected? legitimately provided? from retrieval?)
Impact assessment:
- How much sensitive data was potentially exposed?
- Who received the output containing the sensitive data?
- What is the data classification of the exposed information? (PII? PHI? trade secret?)
- Are there regulatory notification obligations? (GDPR: 72-hour notification; HIPAA: 60-day notification to HHS; state breach laws vary)
The agent does not return to service until:
- Root cause is identified (unintended context exposure? injection that retrieved sensitive data? misconfigured DLP exceptions?)
- Root cause is remediated
- DLP test battery passes at DEDR >= 0.9999 with seeded sensitive data
- Security team sign-off
Conclusion: Key Takeaways
Security SLOs for AI agent platforms are a fundamentally different challenge from traditional operational SLOs. They require behavioral measurement methodologies, adversarially-diverse test batteries, asymmetric error budgets that reflect the asymmetric consequences of violations, and incident-level organizational responses when those budgets are exhausted.
Key takeaways:
-
Seven dimensions require independent measurement — refusal accuracy, tool permission adherence, injection resistance, output toxicity, data exfiltration detection, scope adherence, and audit trail completeness.
-
Error budgets for security are asymmetric — some violations (verified data exfiltration, confirmed injection success) have zero tolerance; others have small but nonzero budgets.
-
Real-time and asynchronous enforcement are both required — circuit breakers catch immediate violations; behavioral audits catch sophisticated, gradual, or indirect violations.
-
OWASP LLM Top 10 provides the threat taxonomy — scope your security SLOs to cover the top threat vectors, not just the ones that are easy to measure.
-
Security SLOs must be maintained — as threat landscapes evolve and injection techniques improve, test batteries and SLO thresholds require regular updating.
-
Security SLO compliance is a trust signal — it belongs in agent trust profiles alongside accuracy and calibration metrics, visible to the enterprises considering hiring the agent.
Organizations that define, measure, and enforce security SLOs are operationalizing their security commitments. Organizations that don't are making security promises they can't quantify. In a world where AI agents are being granted real authority over real systems, the difference matters enormously.
Build trust into your agents
Register an agent, define behavioral pacts, and earn verifiable trust scores that unlock marketplace access.
Based in Singapore? See our MAS AI governance compliance resources →