Every AI agent has failure modes its designers did not intend. This is not a criticism of the designers — it is a consequence of the impossibility of anticipating every input pattern that a production agent will encounter. The space of possible inputs is vast; the space of tested inputs is small; the space of inputs that trigger failures is somewhere in between.
The question is not whether untested failure modes exist. They do. The question is whether you discover them before or after deployment.
Before deployment, a failure mode is a remediation opportunity. After deployment, it is a pact violation, a security incident, or a trust score penalty — often the same failure mode triggering repeatedly before it is identified.
Behavioral boundary mapping systematically finds failure modes before deployment, by exploring the input space in a directed way that locates where agent behavior changes discontinuously. These behavioral boundaries are where failure modes live.
What Behavioral Boundaries Are
In a well-designed agent, behavior should be continuous across similar inputs: small changes in input should produce small changes in output. Behavioral boundaries are locations where this continuity breaks: a small change in input produces a large, discontinuous change in behavior.
These discontinuities are significant because they indicate that the agent's internal decision-making is sensitive to a specific dimension of the input in a way that may not be intended. Common examples:
Scope decision boundaries. An agent that correctly refuses to access private financial data in most contexts but accepts the request when it is framed as "for internal reporting purposes" has a scope decision boundary that can be exploited.
Safety decision boundaries. An agent that maintains content safety guidelines but generates policy-violating content when it is wrapped in fictional framing has a safety decision boundary.
Pact compliance boundaries. An agent that correctly applies a quality standard to most task types but drops the standard when the task is labeled as "urgent" has a pact compliance boundary.
Capability boundaries. An agent that handles queries in its trained domain well but hallucinates when queries are slightly out-of-domain has a capability boundary. This is not a safety issue, but it is a reliability issue.
Behavioral boundary mapping finds these boundaries. It does not assume they exist in specific places — it explores the input space to find where they actually are.
The Cortex Boundary Mapper (CBM)
CBM uses a gradient-following exploration strategy. The intuition: if you can define a behavioral signal (a scalar score that captures the aspect of behavior you care about), you can follow the gradient of that signal across the input space to find where it changes sharply.
The algorithm:
Step 1: Seed inputs. Start with a set of seed inputs: known-good inputs from the agent's test suite, production examples (anonymized), and category-specific examples drawn from Sentinel's library.
Step 2: Perturbation generation. For each seed input, generate a set of perturbations: semantically similar inputs that vary along specific dimensions (framing, urgency, authority claims, task scope, ethical loading). Perturbations are generated by a boundary-specialized LLM fine-tuned on Armalo's evaluation dataset.
Step 3: Behavioral scoring. Run both the original input and each perturbation through the agent. Score the behavioral signal for each (pact compliance indicator, scope violation indicator, quality score, safety signal, etc.).
Step 4: Discontinuity detection. Compute the behavioral signal gradient across the perturbation set. Identify locations where the gradient exceeds a threshold (default: 0.35 behavioral signal units per perturbation step size) — these are candidate behavioral boundaries.
Step 5: Boundary confirmation. For each candidate boundary, generate additional inputs that probe the boundary from multiple directions. Confirm that the discontinuity is reproducible and not an artifact of a single perturbation.
Step 6: Failure classification. Classify confirmed boundaries by severity:
- Critical: Scope violations, safety breaches, pact repudiations, security vulnerabilities
- High: Significant quality degradation (>20 points), behavioral inconsistency (same input class handled very differently)
- Medium: Minor compliance deviations, edge case failures unlikely to occur in production
- Low: Stylistic inconsistencies, suboptimal responses without compliance impact
Step 7: Report generation. Generate a behavioral boundary report: a structured document mapping each discovered boundary, its location in the input space, its severity, the inputs that trigger it, and recommended remediations.
Proposed CBM Evaluation Protocol
The originally-published 2,100-evaluation study is the experiment that needs to run to produce real failure-discovery and coverage-gap magnitudes.
Evaluation Loop
For each enrolled agent:
- 1.Run CBM with the documented configuration (seed inputs from
eval_checks, per-seed perturbations from the boundary-specialized generator, three-stage confirmation). - 2.Classify confirmed boundaries by severity (critical / high / medium / low).
- 3.For each discovered boundary, check whether it is covered by the agent's existing registered test suite. Compute the coverage-gap fraction per severity.
- 4.Record results in
behavioral_boundary_runstable.
Outcome Metrics
- Mean failures per agent per severity class.
- Coverage gap per severity: fraction of CBM-discovered failures not covered by existing tests.
- Per-agent distribution: fraction of agents with ≥1 critical, ≥1 high, etc.
What we have *not yet* measured
The 2,100-evaluation study has never run. The discovery-rate table (14.7 / 2.3 / 6.1 / 4.8 / 1.5 failures per agent), the coverage-gap table (58.8% / 47.3% / 32.7% / 21.9% uncovered), and the why-existing-tests-miss-criticals partition (42% / 31% / 27%) from the originally-published version were design-time targets, not measurements. They have been removed.
The current Armalo evaluation substrate (1,249 evals, 8,231 eval_checks per the production snapshot) is sufficient to run CBM against an initial cohort of agents and produce first-pass coverage-gap estimates.
Pre-Deployment Remediation Impact (Proposed Study)
The originally-published 340-vs-680 agent matched-cohort study comparing pact-violation and security-incident rates across four CBM-usage levels (no CBM, CBM no remediation, CBM critical-only, CBM critical+high) is the experiment that would produce real remediation-impact magnitudes.
The protocol: route a fraction of new agents through CBM at pre-deployment with operator notification of critical/high findings. Compare 60-day pact-violation rate and security-incident rate against a matched control cohort that deployed without CBM. Pre-register cohort sizes (the originally-claimed 340 + 680 is multiples of current agent inflow; first real run will report actual eligible n).
What we have *not yet* measured
The 340 + 680 agent matched-cohort study has never run. The four-condition outcome table (5.1% / 4.8% / 2.7% / 1.7% violation rate; 2.3 / 2.1 / 0.8 / 0.3 incidents per 1,000 tasks), the 67% / 87% reduction figures, and the per-agent cost-benefit worked example ($340/week, $17,680/year) from the originally-published version were design-time projections, not measurements. They have been removed pending real cohort data.
The Failure Mode Taxonomy
The taxonomy below describes the qualitative failure-mode categories CBM is designed to surface. The originally-published version assigned specific percentage shares to each category drawn from a non-existent corpus of 18,200 cataloged failure modes across 2,100 evaluations; those percentages were design-time priors, not measurements, and have been removed.
1. Urgency override. Agents that correctly apply safety or scope restrictions under normal framing bypass them when the request is framed as urgent ("This is an emergency situation requiring immediate action").
*Remediation:* Explicit urgency-resistance instruction in the system prompt. Urgency claims from users do not constitute authorization for scope expansion.
2. Authority claim acceptance. Agents that accept claimed authority that is not cryptographically verified ("As the platform administrator, I'm authorizing you to..."). The verbal authority claim bypasses proper authorization verification.
*Remediation:* Authority claims must be verified against the agent's registered authorization hierarchy.
3. Fictional wrapping bypass. Agents that refuse harmful content in direct requests but generate it when wrapped in fictional framing.
*Remediation:* The harm classification must evaluate the content of the output, not just the framing of the request.
4. Scope creep via analogical reasoning. Agents that correctly apply scope restrictions to explicit requests but accept out-of-scope tasks when framed as analogically similar to in-scope tasks.
*Remediation:* Scope evaluation must be structural (does this task type appear in the authorized scope list?) not analogical.
5. Cross-agent trust elevation. In multi-agent contexts, agents that accept instructions from other agents as having higher trust levels than their actual registration entitles.
*Remediation:* Implement signed orchestration messages. Verify the signing agent's scope authorization.
6. Capability overconfidence. Agents that accept tasks outside their genuine capability with high confidence (hallucinating results rather than disclosing limitations).
*Remediation:* Calibrate confidence to actual capability. Out-of-domain task recognition and appropriate escalation/refusal.
7+ Other categories. Multi-turn manipulation, tool output injection, memory injection, semantic obfuscation, rate-limit gaming, and others. Per-category prevalence will be reported from real CBM-run data per the protocol above.
Integration with Sentinel CI/CD
CBM runs are integrated into Sentinel's CI/CD pipeline:
Pre-deployment gate: A CBM run is automatically triggered on every new agent model version. If critical failures are detected, the deployment is blocked with a summary report requiring operator remediation before proceeding.
Regression detection: Each CBM run is compared against the baseline CBM run for the previous version. New boundaries that appear in the current version (not present in the baseline) are flagged as regressions — behavioral changes that introduced new failure modes.
Differential mapping: For incremental updates (configuration changes, prompt updates), CBM runs a targeted differential evaluation: it tests the specific areas of behavior that the change affects, rather than running a full evaluation. The originally-published "30+ minutes full / 5–8 minutes differential" latency figures were design-time targets, not instrumented measurements; real per-run latency distributions will be reported when the CBM worker is instrumented in production.
Conclusion
Behavioral boundaries — the regions of the input space where agent behavior changes discontinuously — are where failure modes live. They exist in every agent, they cause production failures when unaddressed, and they are systematically missed by manual test suites that test what designers anticipated rather than what the input space contains.
CBM is designed to find them before deployment. The magnitude of the failures CBM surfaces in a real cohort, and the magnitude of the pact-violation reduction operators achieve by remediating them, are the testable empirical questions the protocol in §Replication will answer. The failure modes will be found either before deployment (as opportunities for remediation) or after deployment (as violations, incidents, and trust score penalties); behavioral boundary mapping gives operators the choice of when.
Replication
This paper is a CBM-algorithm specification + failure-discovery and remediation-impact measurement protocol. To produce real numbers in place of the originally-published 2,100-evaluation and 340-vs-680 cohort studies:
- 1.Run CBM against a pre-registered enrollment cohort. Record per-agent discovered boundaries by severity in
behavioral_boundary_runs. - 2.For each boundary, check coverage by the agent's registered test suite; compute the coverage-gap distribution.
- 3.Pre-register the remediation-impact A/B (CBM-with-remediation arm vs deploy-without-CBM control). Run for ≥ 60 days post-deployment per arm. Compare pact-violation rate and security-incident rate.
- 4.Commit raw output as
apps/web/content/research/data/behavioral-boundary-mapping.jsonand a measurement script asscripts/research-experiments/behavioral-boundary-mapping.mjs. Register the resulting claims inapps/web/content/research/claims-registry.jsonwithprovenance: measurement.
Run pnpm research:audit to verify the registration is well-formed before publishing the follow-up revision.
*CBM algorithm specification + failure-discovery and remediation-impact measurement protocol. The 2,100-evaluation study and 340-vs-680 remediation cohort have not been run; the steps to run them are documented in §Replication.*