The Cognitive Overload Boundary
Human working memory can hold approximately seven items (plus or minus two, per Miller's law). For agent oversight, the practical implication is that a single person can maintain meaningful situational awareness of roughly five to nine agents before the cognitive model starts to degrade. They stop knowing which agent does what, which data it can access, when it was last evaluated, and who to call if something goes wrong.
At twenty agents, a team might maintain awareness through informal communication and daily standups. At fifty agents, that approach is already failing โ people have gaps in their mental model, nobody has a complete picture, and institutional knowledge starts concentrating in whoever set up the original deployments. At one hundred agents, the informal approach is simply incompatible with responsible operation. The organization is, at that point, running a fleet it does not understand.
The solution is not to keep the team small enough that informal awareness works. The solution is to build systematic governance infrastructure so that the required knowledge lives in systems, not heads.
The Audit Surface Problem
Consider a modest estimate: 100 agents each executing 100 actions per day. That is 10,000 decisions per day that your organization has effectively made via automation. At 250 working days per year, that is 2.5 million agent-executed decisions annually.
No human team can review 10,000 decisions per day. But you cannot simply not review them. Some fraction will be wrong. Some fraction will be policy violations. Some fraction will be consequential enough that someone will eventually ask for an explanation. The governance question is not "will we review everything" โ the answer to that is obviously no. The question is: what is our sampling strategy, what triggers a full review, and can we reconstruct the decision context for any individual action if we need to?
The only way to make that work at 100+ agents is to design for it upfront: structured audit logging, anomaly detection, stratified sampling, and clear escalation triggers.
The Failure Correlation Risk
At small fleet sizes, agent failures are mostly independent. Agent A fails, Agent B keeps running. The impact is bounded.
At 100 agents, something changes structurally. Agents share models. They share data sources. They share tools and integrations. They may share memory or context. A model-level failure (a safety update that changes behavior, a provider outage, a prompt injection vulnerability) can hit the entire fleet simultaneously. A data source failure hits every agent pulling from that source. A compromised shared tool can propagate through the entire fleet.
This is the failure correlation problem. It is analogous to the concentration risk problem in financial portfolios: diversification within the same underlying asset class does not actually reduce systemic risk. One hundred agents all pulling from the same CRM, using the same LLM provider, and executing on the same infrastructure are not one hundred independent risk units. They are one highly correlated risk unit that happens to have one hundred manifestations.
Governing for this requires explicit dependency mapping, fleet-wide pause capabilities, and blast radius analysis for every shared component.
The Regulatory Exposure Acceleration
The EU AI Act came into force in August 2024, with enforcement timelines staggered through 2026 and 2027. Article 9 requires providers and deployers of high-risk AI systems to implement a risk management system โ documented, systematic, iterative, covering the full lifecycle. Article 13 requires transparency and provision of information to users. Article 14 requires human oversight measures with specific capabilities: ability to monitor, ability to intervene, ability to override, ability to halt. Article 17 requires a quality management system with explicit roles and responsibilities.
None of these requirements are satisfied by informal governance. They require documented systems. They require named roles. They require evidence of ongoing operation.
ISO/IEC 42001:2023, the AI management system standard, makes similar demands in clause 6.1 (risk assessment and treatment) and clause 8.4 (AI system operation). Organizations seeking certification or preparing for regulatory inquiry need to be able to demonstrate, not just claim, that their AI governance is systematic and documented.
At one agent, this is manageable with relatively light documentation. At one hundred agents, it requires proper infrastructure. The organizations that build that infrastructure before they need to demonstrate it have a significant advantage over those that try to retrofit documentation onto an already-running fleet.
The Authority Vacuum Problem
Before getting to solutions, it is worth naming the core structural problem more precisely: the authority vacuum.
In a traditional organizational hierarchy, every decision has an owner. When a customer service rep makes a promise to a customer, we know who made that promise. When that promise turns out to be wrong, we know who is responsible for the error, who should communicate the correction, and who has the authority to offer a remedy. The accountability structure exists.
When an AI agent makes a representation to a customer, the accountability structure often does not exist. The agent was built by one team, deployed by another, operating under a policy owned by a third, using data managed by a fourth. If something goes wrong:
- Who owns the incident? (Nobody has specifically been assigned to own agent incidents)
- Who has the authority to take the agent offline? (Unclear โ the team that deployed it is no longer actively monitoring it)
- Who is responsible for communicating with the affected party? (The customer service team, but they didn't deploy the agent and don't know its full behavior)
- Whose budget covers remediation? (Legal? IT? The business unit that requested the agent?)
- Who is responsible for ensuring it doesn't happen again? (Everyone nominally, nobody specifically)
This is the authority vacuum. It is not primarily a technology problem. It is an organizational design problem. And the way you solve organizational design problems is with explicit role assignments, documented accountability structures, and clear escalation paths โ before the incidents happen.
The operating model described in this post is primarily an instrument for eliminating the authority vacuum.
Component 1: The Agent Registry
The foundational artifact of any agent governance system is the registry. Before you can govern agents, you need to know they exist and what they do. This sounds obvious. It is routinely ignored.
In most organizations with a sprawling agent deployment, the actual inventory lives in a combination of: someone's head, a Confluence page that was last updated eight months ago, a Slack channel where deployment announcements were made, and the infrastructure dashboards for whoever hosts the compute. Nobody has a definitive list of all deployed agents, their owners, their data access levels, and their last evaluation dates.
The agent registry is the solution.
Required Schema
A production-grade agent registry should capture at minimum:
{
"agent_id": "uuid-v4",
"display_name": "Customer Refund Eligibility Agent",
"owner_email": "jane.doe@company.com",
"owner_team": "Customer Experience Engineering",
"backup_owner_email": "john.smith@company.com",
"purpose_statement": "Evaluates customer refund requests against current refund policy and routes to human review when eligibility is ambiguous or value exceeds $500",
"capability_scope": [
"read:crm_customer_records",
"read:order_history",
"read:refund_policy_current",
"write:refund_request_status",
"call:human_review_queue"
],
"explicit_exclusions": [
"write:financial_transactions",
"send:customer_communications",
"modify:refund_policy"
],
"data_access_level": "confidential",
"pii_access": true,
"financial_authority_usd": 0,
"model_provider": "anthropic",
"model_version": "claude-3-5-sonnet-20241022",
"deployment_environment": "production",
"deployment_date": "2025-11-15T00:00:00Z",
"last_eval_date": "2026-03-01T00:00:00Z",
"last_eval_score": 87.3,
"pact_id": "pact-uuid-here",
"status": "active",
"risk_tier": "medium",
"regulatory_flags": ["gdpr_article_22"],
"review_cycle_days": 30,
"next_required_review": "2026-04-01T00:00:00Z",
"integrations": [
{"name": "Salesforce CRM", "access_type": "read", "data_sensitivity": "confidential"},
{"name": "Order Management System", "access_type": "read", "data_sensitivity": "internal"}
],
"escalation_contact": "oncall-cx-eng@company.com",
"incident_channel": "#cx-agent-incidents",
"approved_by": "governance-board",
"approval_date": "2025-11-10T00:00:00Z"
}
The fields that organizations most commonly omit โ and most frequently need during incidents โ are:
explicit_exclusions: What is the agent explicitly prohibited from doing? This is as important as what it can do. Without it, "capability scope" is interpreted as exhaustive, and the agent gets blamed for doing something that was never specifically prohibited but should have been obvious.
financial_authority_usd: What is the maximum financial value this agent can commit to, execute, or influence? Zero means it can only read and route. A positive number means it can execute transactions up to that amount without human approval. This field directly drives the escalation logic in incident management.
regulatory_flags: What regulations apply to this agent's operation? GDPR Article 22 (automated decision-making) is relevant for agents making decisions that significantly affect individuals. EU AI Act Annex III categories are relevant for high-risk applications. Having this in the registry means you can instantly answer "which agents are subject to this regulation" when a compliance team asks.
explicit_exclusions: What is this agent explicitly prohibited from doing? This should be documented separately from capability_scope because it communicates intent โ these were considered and rejected, not just omitted.
Lifecycle States
Every agent should have an explicit lifecycle state, and there should be defined transitions between states:
draft โ reviewed โ approved โ deployed โ monitored โ suspended โ retired
โ โ
(return for (suspend on
revision) incident/review)
draft: Being built. Not accessible to customers or other systems. Can be modified freely.
reviewed: Technical review complete. Security and data access review complete. Policy review complete. Awaiting governance approval.
approved: Approved for deployment by the appropriate authority (based on risk tier). Deployment can proceed.
deployed: Running in production but within an initial monitoring period (typically 30 days). Enhanced logging enabled. Daily review of anomalies required.
monitored: Standard operational status. Normal monitoring and evaluation cadence applies.
suspended: Temporarily offline due to incident, policy violation, evaluation failure, or required review. Specific resolution criteria documented before suspension.
retired: Permanently decommissioned. Registry entry retained for audit history. All integrations and data access revoked.
Mandatory Review Triggers
Not all reviews happen on schedule. Certain events should automatically trigger an out-of-band review regardless of when the last review occurred:
- Model version change: The underlying LLM was updated. Behavior may have changed in ways not captured by the previous evaluation.
- Capability scope change: New data sources, new tools, or expanded action authority. The previous risk assessment is no longer valid.
- Data access change: The agent now has access to data it did not previously have access to, or data it accesses has been reclassified to a higher sensitivity tier.
- Evaluation score drop greater than 20 points: Significant behavioral degradation detected. Something changed.
- Owner change: The person accountable for this agent is no longer in the role. New owner must formally accept accountability.
- Integration change: A system the agent integrates with has been updated, migrated, or had its data model changed.
- Regulatory change: A new regulation applies, or the interpretation of an existing regulation has changed in a way that affects this agent.
- Incident involving this agent: Any P1 or higher incident involving this agent triggers a review before the agent returns to monitored status.
Component 2: The RACI Matrix
Once you have a registry, you need to document who is responsible for what at each stage of the agent lifecycle. The RACI matrix is the standard tool for this.
R = Responsible (does the work)
A = Accountable (owns the outcome โ only one A per row)
C = Consulted (input required before action)
I = Informed (notified after action)
Agent Lifecycle RACI
| Activity | Agent Owner | Platform Team | Security | Legal/Compliance | Governance Board | Exec Sponsor |
|---|
| Draft agent specification | R/A | C | C | C | I | I |
| Technical review | R | R | A | C | I | I |
| Security and data access review | C | R | A | C | I | I |
| Policy review | C | C | C | A | C | I |
| Risk tier classification | R | C | C | A | I | I |
| Governance approval (low risk) | R | C | C | C | A | I |
| Governance approval (medium risk) | R | C | C | C | A | I |
| Governance approval (high risk) | C | C | C | C | R | A |
| Deployment to production | A | R | I | I | I | I |
| Initial monitoring period oversight | A | R | I | I | I | I |
| Ongoing evaluation execution | R | C | I | I | I | I |
| Anomaly escalation | A | R | C | I | I | I |
| P3 incident response | R/A | C | I | I | I | I |
| P2 incident response | A | R | C | C | I | I |
| P1 incident response | A | R | R | C | I | I |
| P0 incident response | C | R | R | R | A | I |
| Fleet-wide pause decision | C | C | C | C | A | R |
| Suspension decision | C | A | C | C | R | I |
| Post-incident review | A | R | C | C | I | I |
| Policy update (minor) | C | R | C | A | I | I |
| Policy update (major) | C | C | C | C | A | R |
| Retirement | A | R | I | I | I | I |
| Retirement audit | C | R | I | A | I | I |
Incident Response RACI (Detail)
Incident response deserves its own more detailed RACI because the stress of an incident is precisely when accountability structures fail if they are not explicit.
| Action | On-Call Engineer | Agent Owner | Platform Lead | Security | Communications | Governance |
|---|
| Detect and triage | R | I | I | I | I | I |
| Classify severity | R | C | A | C | I | I |
| Notify stakeholders | R | I | A | I | I | C |
| Contain (suspend agent) | A | C | R | C | I | I |
| Assess blast radius | R | C | A | C | I | I |
| Communicate to affected parties | I | C | C | I | A | I |
| Root cause analysis | R | R | A | C | I | I |
| Draft post-incident review | A | R | C | I | I | I |
| Approve remediation plan | C | A | C | C | I | R |
| Implement remediation | R | A | C | C | I | I |
| Approve return to service | C | A | C | C | I | R |
| Update pact/policy | A | R | C | C | I | I |
| Publish incident summary | I | C | A | I | I | R |
The most important thing about this table is not the specific assignments โ your organization will adjust these based on your structure. What matters is that every cell is filled before you have an incident, every role is staffed, and every person in each role knows they are in it.
Component 3: Three-Tier Governance Structure
Agent governance needs to operate at three different timescales with three different types of decisions. Organizations that try to handle everything at one level โ usually either too operational (daily firefighting) or too strategic (quarterly board updates) โ fail to catch problems at the right moment.
Tier 1: Operational Governance (Daily, Team Level)
Purpose: Keep the fleet healthy. Catch problems before they become incidents.
Who is involved: Agent owners, on-call rotation, platform team ops function.
Meetings: Brief daily standup (15 minutes) for teams with high-frequency agents. Async for teams with lower-activity agents.
Decisions made at this tier:
- Anomaly review and initial triage
- Short-term suspension decisions (within defined criteria)
- Evaluation run scheduling
- Routine maintenance operations (config updates, monitoring threshold adjustments)
- First-line incident response for P2 and below
Artifacts produced:
- Daily anomaly log entries
- Incident tickets (as needed)
- Evaluation run results
- Daily cost/usage summary
Escalation criteria to Tier 2:
- Any P1 or P0 incident
- Anomalies exceeding defined thresholds for more than 24 hours
- Evaluation score drop exceeding 20 points
- Any situation requiring policy interpretation
- Budget variance exceeding 20% from baseline
Tier 2: Tactical Governance (Weekly, Program Level)
Purpose: Maintain fleet health trends. Make resourcing and policy decisions. Review the operating picture with enough context to see patterns.
Who is involved: Platform team lead, governance engineers, security representative, compliance representative, domain team leads (rotating).
Meetings: Weekly governance review (60-90 minutes).
Decisions made at this tier:
- New agent approvals (low and medium risk)
- Evaluation cadence adjustments
- Policy clarification and minor updates
- Resource allocation for platform team
- Root cause review for closed incidents
- Fleet health metric review and threshold adjustments
- Agent suspension approvals beyond first-line criteria
Artifacts produced:
- Weekly fleet health report
- New agent approval decisions
- Policy update log
- Incident closure summaries
- Budget variance report
Escalation criteria to Tier 3:
- Policy changes with organization-wide impact
- High-risk agent approvals
- Budget decisions above delegated authority
- Regulatory findings requiring executive attention
- Any pattern suggesting systemic fleet risk
- Fleet-wide pause decisions
Tier 3: Strategic Governance (Monthly + Quarterly, Executive/Board Level)
Purpose: Ensure the agent program remains aligned with organizational strategy, risk appetite, and regulatory requirements. Make decisions that require executive authority.
Who is involved: CISO or equivalent, General Counsel, CFO (for budget decisions), CTO/CPO, board-level audit committee (quarterly).
Meetings: Monthly executive review (30-45 minutes). Quarterly board audit committee presentation.
Decisions made at this tier:
- Risk appetite statements for AI agent program
- High-risk agent approvals
- Fleet-wide pause authorization
- Major policy changes
- Regulatory response strategy
- Budget authority increases
- External audit engagement and response
- Strategic direction for agent program expansion or contraction
Artifacts produced:
- Monthly executive dashboard
- Quarterly board presentation on AI agent risk posture
- Annual AI governance report
- Regulatory correspondence and responses
Component 4: Policy Hierarchy
Without a policy hierarchy, agents operate under either no policy (dangerous) or an inconsistent patchwork of team-specific rules (also dangerous, for different reasons). The policy hierarchy solves both problems.
The Four-Level Structure
Level 1: Platform Policy
Set by: The organization's AI governance board or equivalent.
Scope: All agents, everywhere, always.
Examples:
- No agent may take irreversible financial actions above $[threshold] without human approval
- All agents handling PII must log access to the audit trail
- No agent may modify its own behavioral constraints or access permissions
- All agents must honor kill switch commands within [N] seconds
- Agents may not represent themselves as human to customers
Override authority: None. These are absolute constraints. If a team needs an exception, they escalate to the governance board, which decides whether to modify the platform policy itself โ not grant an individual exception.
Level 2: Organizational Policy
Set by: Business unit, division, or functional group (e.g., Customer Experience, Finance, Engineering).
Scope: All agents operated by that organizational unit.
Examples (for a Customer Experience org):
- Agents may not offer refunds exceeding $250 without supervisor approval
- All customer-facing agents must acknowledge inability to resolve and offer human escalation within 3 turns
- Agents may not retain personally identifiable information beyond the session unless explicitly consented
Override authority: Can be more restrictive than platform policy. Cannot be more permissive than platform policy.
Level 3: Team Policy
Set by: Individual product teams or operational teams.
Scope: All agents operated by that team.
Examples (for a specific product team's customer support agents):
- Agents operating in the returns flow may not initiate shipping label creation
- Agents may reference pricing information from the current catalog only
- Escalation to human review is mandatory for orders placed more than 90 days ago
Override authority: Can be more restrictive than org policy. Cannot be more permissive than org policy.
Level 4: Agent Policy (Behavioral Pact)
Set by: Agent owner, reviewed by governance.
Scope: This specific agent.
Examples:
- This agent's outputs must be within the scope of: refund eligibility assessment. Any request outside this scope receives a standard decline-and-redirect response.
- This agent will not reference competitor products under any circumstances.
- If asked about account balances, this agent will always redirect to the balance inquiry agent rather than attempting to answer from context.
Override authority: Can be more restrictive than team policy. Cannot be more permissive than team policy.
Inheritance and Override Rules
The inheritance model is simple: child levels inherit all constraints from parent levels and may only add additional constraints, never remove them.
This means that if you want to understand what an individual agent can and cannot do, you read its behavioral pact (Level 4), then overlay the team policy (Level 3), then the org policy (Level 2), then the platform policy (Level 1). The effective constraint is the union of all four.
For tooling purposes, this means your registry should resolve the effective policy for any agent on demand โ a query that returns the complete merged constraint set, with each rule tagged by the level it came from and whether it is inheritable or terminal.
Policy Change Management
Policy changes require versioned documentation and impact assessment:
- Propose change: Owner submits policy change request with: current text, proposed text, reason, impacted agents, expected impact on agent behavior.
- Impact analysis: Platform team identifies all agents affected by the change (this requires the registry to be queryable).
- Review: Changes at Level 1-2 require Tier 3 approval. Level 3 requires Tier 2 approval. Level 4 requires Tier 1 approval with periodic Tier 2 review.
- Testing: For changes affecting more than 10 agents, staged rollout with evaluation comparison before and after.
- Communication: All affected agent owners notified before change takes effect.
- Archive: Previous policy version retained indefinitely for audit purposes.
Component 5: Budget Authority and Financial Controls
One of the most frequently neglected aspects of agent governance is financial authority. When an agent can execute transactions, make commitments, or influence purchasing decisions, it is exercising financial authority on behalf of the organization. That authority needs to be explicitly defined and controlled.
Agent Financial Tier Classification
Tier 0: Read-Only Agents
Capabilities: Data retrieval, summarization, analysis, recommendations.
Financial authority: $0. No ability to initiate transactions or commitments.
Examples: Research agents, summarization agents, analytics agents.
Approval requirement: Tier 1 (team level) sufficient.
Tier 1: Low-Value Transactional Agents
Capabilities: Can execute individual transactions up to a defined limit.
Financial authority: $0.01 to $[org-defined threshold, typically $100-$500].
Examples: Agents that can issue small refunds, apply discount codes, process routine small transactions.
Approval requirement: Tier 2 (program level). Financial controls review required.
Monitoring: Daily transaction volume and value monitoring. Alert if aggregate daily spend exceeds 3x baseline.
Tier 2: Medium-Value Transactional Agents
Capabilities: Can execute transactions or make commitments in the medium-value range.
Financial authority: $[Tier 1 limit] to $[org-defined threshold, typically $1,000-$10,000].
Examples: Procurement agents with PO authority, agents managing subscriptions or service contracts.
Approval requirement: Tier 3 (executive level). CFO signoff required.
Monitoring: Real-time transaction monitoring. Mandatory human review for any single transaction above [org-defined threshold, typically $5,000].
Tier 3: High-Value Agents
Capabilities: Can influence or initiate high-value financial decisions.
Financial authority: Above $[Tier 2 limit].
Examples: Strategic sourcing agents, agents managing large contract renewals.
Approval requirement: Executive approval plus legal review.
Monitoring: Every transaction reviewed by human before execution. Escrow requirement for all commitments.
Budget Delegation Matrix
| Decision | Agent Owner | Team Lead | Department Head | CFO | Board |
|---|
| Approve Tier 0 agent | โ | | | | |
| Approve Tier 1 agent | | โ | | | |
| Approve Tier 2 agent | | | โ | โ | |
| Approve Tier 3 agent | | | | โ | โ |
| Increase Tier 1 limit | | โ | | | |
| Increase Tier 2 limit | | | โ | โ | |
| Authorize fleet-wide spend increase | | | | โ | |
| Approve agent program annual budget | | | | โ | โ |
| Emergency financial freeze | | | | โ | |
Agent Cost Accounting
Every agent should have a monthly cost account that tracks:
- Compute costs: Infrastructure cost for running the agent (whether dedicated or shared)
- LLM inference costs: API calls to model providers, attributed per agent via request tagging
- External service call costs: Third-party APIs, data services, tools
- Human escalation costs: Time cost of human reviews triggered by the agent (either actual or estimated)
- Incident costs: Remediation, communication, and review costs attributed to agent incidents
Total monthly cost per agent should be tracked and reported at Tier 2 governance weekly. Agents with rising costs but stable output should trigger an efficiency review. Agents where incident costs are a significant fraction of operation costs warrant a fundamental rethink.
Bond and Escrow Requirements
For high-stakes agents operating in contexts where mistakes have significant financial consequences, consider requiring financial bonds โ reserved capital that covers the expected maximum loss from a single agent failure event.
The bond calculation:
bond_requirement = max_single_transaction_value ร failure_probability ร remediation_multiplier
For an agent with $10,000 maximum transaction authority, a 0.1% estimated failure probability, and a 3x remediation cost multiplier:
bond = $10,000 ร 0.001 ร 3 = $30
This is a modest requirement. But for an agent with $1,000,000 maximum transaction authority and the same failure probability:
bond = $1,000,000 ร 0.001 ร 3 = $3,000
Bond requirements create a direct incentive for teams to operate high-stakes agents at demonstrated high quality. The better the evaluation score, the lower the estimated failure probability, and the lower the bond requirement.
Component 6: Incident Management at Fleet Scale
Incidents happen. The question is not whether they will but whether your organization is prepared to respond effectively when they do. At fleet scale, incident management requires explicit structure.
Severity Classification
P0: Active Harm
Definition: The agent is currently causing harm to users, customers, or third parties, or is actively operating outside its authorized scope in a way that creates legal or financial liability.
Examples:
- Agent made a financial commitment it was not authorized to make, and the counterparty is relying on it
- Agent shared confidential information with an unauthorized party
- Agent is producing harmful, illegal, or significantly false content to users
- Agent is executing actions that are modifying data it should not have write access to
Required actions:
- Immediate suspension of the agent (target: within 5 minutes of classification)
- Immediate notification of agent owner, platform lead, security, and legal
- Escalation to Tier 3 within 30 minutes if financial or legal exposure is confirmed
- Customer/user communication plan within 2 hours if users are affected
- No return to service without Tier 3 approval
P1: Policy Violation
Definition: The agent violated a documented policy, but no active harm is currently occurring or has been confirmed.
Examples:
- Agent operated outside its defined capability scope
- Agent accessed data at a higher sensitivity tier than authorized
- Agent bypassed a required human review step
- Agent made a representation that contradicts current policy
Required actions:
- Suspend agent within 1 hour of classification
- Notify agent owner and platform lead within 2 hours
- Security review within 24 hours
- Root cause analysis within 72 hours
- Remediation plan and review before return to service
P2: Performance Degradation
Definition: Agent is operating but at significantly below-expected quality levels, creating risk of user harm or loss of trust without confirmed active harm.
Examples:
- Evaluation score dropped more than 20 points from baseline
- Error rate exceeded threshold for 24+ hours
- Response accuracy fell below acceptable threshold
- Latency exceeded SLA for extended period
Required actions:
- Notify agent owner within 4 hours
- Evaluation re-run within 24 hours
- If root cause not identified within 48 hours, escalate to P1
- Document in weekly governance review
P3: Anomaly Detected
Definition: Something unusual was detected that does not yet meet the threshold for P2 but warrants monitoring and investigation.
Examples:
- Unusual usage patterns (time of day, volume spikes)
- Unexpected input distribution changes
- Minor accuracy fluctuations within acceptable range but trending
- New types of user requests appearing that were not anticipated
Required actions:
- Log in anomaly tracking system
- Assign to agent owner for investigation within 5 business days
- Review at next Tier 2 governance meeting
- Escalate to P2 if pattern continues for 7 days
Escalation Paths
Anomaly Detected (P3)
โ If unresolved after 7 days โ P2
โ If blast radius confirmed โ P1
Performance Degraded (P2)
โ If root cause requires policy change โ P1
โ If evidence of data exposure โ P0
Policy Violation (P1)
โ If active harm confirmed โ P0
โ If financial/legal exposure โ P0
โ If widespread (affecting 10+ users) โ P0
Active Harm (P0)
โ Immediate escalation to governance board and executive sponsor
โ Legal and communications teams activated
Fleet-Wide Pause Capability
Every agent fleet operating beyond 20 agents needs a tested, functional fleet-wide pause capability. This is not optional. The need will arise โ a model provider security incident, a discovered shared vulnerability, a regulatory directive โ and the ability to suspend all agents simultaneously with a single authorized command is the difference between a contained response and a chaotic one.
Fleet-wide pause requirements:
- Single command: One command or one API call suspends all agents simultaneously. No agent-by-agent manual process.
- Authorization: Only authorized roles (defined in RACI) can issue the command. The authorization is logged.
- Confirmation: The system confirms how many agents were suspended and at what time.
- Partial pause: Ability to pause by team, by risk tier, by data access level, or by integration (e.g., "pause all agents with access to the payments system").
- Rollback plan: A documented procedure for restoring agents to service, in a defined priority order, with criteria that must be met before each tier of restoration.
- Tested: The pause capability is tested in staging monthly, and in production at least annually (scheduled maintenance window, not a real incident).
Post-Incident Review Protocol
Every P1 and P0 incident requires a post-incident review (PIR). The PIR format should be blameless โ focused on systemic factors, not individual errors โ and should produce actionable outputs.
Standard PIR structure:
1. Incident Summary
- What happened (factual, timeline)
- Who was affected
- Business impact (quantified where possible)
2. Timeline
- First indicator observed: [timestamp]
- First alert triggered: [timestamp]
- Incident classified: [timestamp]
- Agent suspended: [timestamp]
- Root cause identified: [timestamp]
- Incident closed: [timestamp]
- Time to detect (TTD): [calculate]
- Time to contain (TTC): [calculate]
3. Root Cause Analysis
- Five-whys analysis
- Contributing factors
- What mechanisms existed to prevent this, and why did they fail?
4. What Worked
- Detection mechanisms that functioned as intended
- Response actions that were effective
5. What Did Not Work
- Detection gaps
- Response delays or failures
- Communication breakdowns
6. Action Items
- Specific remediation actions, each with: description, owner, due date
- Registry updates required
- Policy updates required
- Pact updates required
- Evaluation updates required
- Monitoring improvements required
7. Return-to-Service Criteria
- Specific measurable conditions that must be met before the agent is returned to service
Component 7: Audit and Compliance Cadence
Compliance is not a moment โ it is a continuous process. The audit cadence defines how that continuous process is structured.
Daily Automated Checks
At fleet scale, humans cannot review everything. Automated checks run continuously and surface anomalies for human attention:
Evaluation drift monitoring: Continuous comparison of live agent behavior against the behavioral baseline established at last evaluation. Drift beyond defined thresholds triggers an alert.
Data access anomaly detection: Logs of every data access are compared against expected patterns. Access outside normal hours, access to unusually large volumes of records, and access patterns inconsistent with the agent's stated purpose all trigger alerts.
Cost anomaly detection: Daily API call volumes and infrastructure costs are compared against baseline. Spikes indicate either unusual usage (worth investigating) or possible abuse.
Capability scope enforcement: Where tooling allows (e.g., via fine-grained permission systems), automated enforcement of capability scope. Attempts by agents to access tools or data outside their scope are blocked and logged.
Integration health: Monitoring of all integrations used by agents. Downstream system degradation should be detected before it causes agent behavioral anomalies.
Evaluation freshness alerts: Automated alerts when any agent's last evaluation date exceeds the required review cycle. No agent should be running in production with a stale evaluation.
Weekly Governance Review
The weekly Tier 2 governance review should be data-driven, with a consistent dashboard reviewed at every meeting:
| Metric | This Week | Last Week | 4-Week Average | Alert Threshold |
|---|
| Total active agents | [n] | [n-1] | [avg] | โ |
| New agents deployed | [n] | [n-1] | [avg] | โ |
| Agents with stale evaluations | [n] | [n-1] | [avg] | >5% of fleet |
| Evaluation pass rate | [%] | [%] | [avg] | <90% |
| Mean eval score (fleet) | [score] | [score] | [avg] | <75 |
| P0 incidents | [n] | [n-1] | [avg] | Any |
| P1 incidents | [n] | [n-1] | [avg] | >2 |
| P2 incidents | [n] | [n-1] | [avg] | >5 |
| Open anomalies (>48h) | [n] | [n-1] | [avg] | >10 |
| Agents pending approval | [n] | [n-1] | [avg] | >15 |
| Fleet weekly cost | [$] | [$] | [avg] | >110% avg |
| Agents with active pacts | [%] | [%] | [avg] | <95% |
Monthly Fleet Audit
Once per month, a systematic audit of the full fleet:
Registry completeness check: Every active agent in production has a registry entry. Every registry entry has all required fields populated. Owners are confirmed current. Escalation contacts are confirmed reachable.
Evaluation currency check: Every active agent has been evaluated within its required review cycle. Stale evaluations are scheduled immediately.
Policy compliance check: Every active agent has a behavioral pact that is current and consistent with the current policy hierarchy.
Access rights review: Every agent's access permissions are reviewed against its current capability scope. Permissions that are no longer needed are revoked. Permissions that have been granted but not documented are investigated.
Integration inventory review: Every integration used by any agent is confirmed as active, current, and operating as expected. Stale integrations are decommissioned.
Budget review: Monthly agent cost accounting is reviewed. Agents with unexpected cost profiles are investigated.
Open items review: All open action items from previous PIRs are reviewed. Overdue items are escalated.
Quarterly External Audit Readiness
Every quarter, the governance team should run through a mock external audit exercise:
Regulatory mapping update: Which regulations apply to the current fleet? Has the regulatory landscape changed in the past quarter? Are there new requirements that affect current operations?
Documentation review: Is all governance documentation current, consistent, and accurate? Could an external auditor understand the governance structure from the documentation alone?
Evidence package preparation: Could you produce, within 48 hours, the full documentation package required by a regulatory request? This includes: registry exports, evaluation records, policy history, incident log, RACI documentation, budget authority records.
Control effectiveness assessment: For each documented control, is there evidence that the control is actually operating? A control that exists on paper but not in practice is a liability, not an asset.
EU AI Act Compliance Checkpoints
For organizations deploying agents that qualify as high-risk under the EU AI Act:
Article 9 โ Risk Management System
Required evidence:
- Documented risk identification process
- Risk assessment records for each high-risk agent
- Residual risk acceptance records
- Evidence that risk management is continuous (not one-time)
Article 13 โ Transparency and Provision of Information
Required evidence:
- Instructions for use documentation for each high-risk agent
- Records of what information was provided to users/operators
- Evidence that intended purpose, limitations, and accuracy levels were disclosed
Article 14 โ Human Oversight
Required evidence:
- Documented human oversight mechanisms
- Evidence that humans can monitor system behavior
- Records of human override/intervention capability
- Evidence that halt capability exists and is tested
Article 17 โ Quality Management System
Required evidence:
- Documented QMS covering design, development, and deployment
- Records of conformity assessment
- Evidence of post-market monitoring
- Incident reporting records
Component 8: Center of Excellence Structure
At fleet scale, governance does not happen by itself. It requires dedicated roles, funded teams, and clear charters. The Center of Excellence (CoE) is the organizational structure that makes sustainable governance possible.
CoE Organization
Platform Team (centralized function)
Charter: Own the registry, tooling, evaluation infrastructure, shared evaluation frameworks, and platform-level policy. Enable domain teams to operate effectively without duplicating infrastructure.
Core responsibilities:
- Agent registry design and maintenance
- Evaluation tooling and shared evaluation frameworks
- Fleet monitoring infrastructure
- Platform policy drafting and enforcement
- Security review process ownership
- Developer tools and self-service capabilities for domain teams
- Governance reporting and dashboards
Domain Teams (distributed function)
Charter: Own the agents in their domain. Run evaluations, respond to incidents, maintain behavioral pacts, and contribute to policy development.
Core responsibilities:
- Agent specification and registry maintenance for their agents
- Regular evaluation execution using platform-provided tooling
- First-line incident response
- Policy interpretation and agent-level policy (behavioral pacts)
- Contributing domain expertise to platform policy development
Governance Board (cross-functional function)
Charter: Set direction, approve high-stakes decisions, review fleet health, ensure regulatory alignment.
Membership: Platform team lead, security representative, legal/compliance representative, representative domain team leads (rotating), executive sponsor.
Meeting cadence: Weekly Tier 2 review. Monthly strategic review.
Core responsibilities:
- Policy change approval
- High-risk agent approval
- Cross-domain incident response coordination
- Regulatory alignment and response
- Fleet-wide pause authorization
- External audit preparation
Staffing Ratios
These ratios are based on observed practice at organizations with mature agent governance programs and should be adjusted based on the risk profile of the fleet:
Governance Engineers (within Platform Team)
Function: Registry management, policy documentation, compliance monitoring, audit trail review, reporting.
Ratio: 1 governance engineer per 20 actively governed agents.
For a 100-agent fleet: 5 governance engineers.
Platform Engineers (within Platform Team)
Function: Registry tooling, evaluation infrastructure, monitoring systems, fleet pause capability, integration health monitoring.
Ratio: 1 platform engineer per 50 agents.
For a 100-agent fleet: 2 platform engineers (or equivalent fractional capacity).
Security Engineer (within or supporting Platform Team)
Function: Access control review, security assessment for new agents, vulnerability monitoring, integration security review.
Ratio: 1 dedicated security engineer for fleets of 50+ agents; fractional engagement for smaller fleets.
Domain Agent Owners (distributed)
Function: Registry maintenance, evaluation execution, behavioral pact maintenance, first-line incident response.
Ratio: Each agent must have a named owner. An agent owner can own 5-10 low-risk agents or 2-3 high-risk agents while maintaining quality oversight.
The Agent Owner Role Description
Agent ownership is a real accountability โ not a name in a field. Organizations that treat it as a formality produce registry entries that are accurate at deployment and wrong six months later.
An agent owner is responsible for:
-
Specification accuracy: The registry entry is accurate and current. When anything changes, the registry is updated within 24 hours.
-
Evaluation currency: Evaluations are run on the required cadence. When scores drop, the owner investigates. When required out-of-band reviews are triggered, the owner runs them.
-
Pact currency: The behavioral pact accurately reflects the agent's intended behavior and is consistent with current policy. The owner reviews the pact at every major evaluation.
-
First-line incident response: When an anomaly or incident involving this agent is detected, the owner is the first call. The owner knows the agent well enough to contribute meaningfully to triage.
-
Retirement: When an agent is no longer needed, the owner initiates the retirement process โ including access revocation and audit documentation.
-
Business continuity: The owner has designated a backup owner who can perform these functions in the event of the owner's unavailability. The backup owner has been briefed.
Agent ownership should be factored into performance reviews and team OKRs. If ownership has no consequences โ positive or negative โ it will not be taken seriously.
Component 9: Metrics and KPIs for the Operating Model
You cannot improve what you do not measure. The governance operating model needs its own metrics โ not just metrics about agents, but metrics about the governance function itself.
Fleet Health Score
The fleet health score is a composite metric that gives a single number representing the current health of the agent fleet from a governance perspective. It should be visible in every governance dashboard.
Calculation:
fleet_health_score = (
eval_pass_rate ร 0.30 +
pact_coverage ร 0.20 +
eval_freshness ร 0.20 +
incident_rate_score ร 0.15 +
anomaly_resolution_rate ร 0.10 +
compliance_coverage ร 0.05
)
Where:
- eval_pass_rate: Percentage of agents whose most recent evaluation met or exceeded the acceptance threshold
- pact_coverage: Percentage of active agents with a current, approved behavioral pact
- eval_freshness: Percentage of active agents with an evaluation completed within the required review cycle
- incident_rate_score: Score that decreases based on frequency and severity of recent incidents (P0 = -20 points; P1 = -10 points; P2 = -3 points; P3 = -1 point; resets toward 100 over 30 days with no incidents)
- anomaly_resolution_rate: Percentage of open anomalies resolved within their target resolution window
- compliance_coverage: Percentage of agents with current regulatory mapping and required compliance documentation
Time-to-X Metrics
Mean Time to Detect (MTTD): Average time from when an agent incident occurs to when it is classified as P2 or higher.
Target: <4 hours for P2; <30 minutes for P1; <15 minutes for P0.
Measurement: Incident log timestamp comparison.
Mean Time to Contain (MTTC): Average time from incident classification to containment (agent suspended or issue mitigated).
Target: <2 hours for P2; <1 hour for P1; <15 minutes for P0.
Measurement: Incident log timestamp comparison.
Mean Time to Resolution (MTTR): Average time from incident classification to confirmed resolution and return to service.
Target: <5 business days for P2; <2 business days for P1; case-by-case for P0.
Measurement: Incident ticket timestamps.
Mean Time to Approve: Average time from a new agent submission to approval decision.
Target: <5 business days for low risk; <10 business days for medium risk; <20 business days for high risk.
Measurement: Registry timestamps.
Coverage Metrics
Agent Registry Coverage: Percentage of agents known to be in production that have a current registry entry.
Target: 100%. No exceptions.
Measurement: Cross-reference registry against infrastructure deployment records monthly.
Evaluation Coverage: Percentage of active agents with a passing evaluation from within the required review cycle.
Target: โฅ95%.
Measurement: Registry query against evaluation records.
Pact Coverage: Percentage of active agents with a current, approved behavioral pact.
Target: 100% for high-risk agents; โฅ95% for medium-risk; โฅ90% for low-risk.
Measurement: Registry query against pact records.
Owner Coverage: Percentage of active agents with a named, current, reachable owner.
Target: 100%.
Measurement: Registry query with owner email verification.
Policy Compliance Coverage: Percentage of active agents for which there is documented evidence that they are operating within current policy.
Target: 100%.
Measurement: Registry query against compliance documentation.
Efficiency Metrics
Agent Utilization Rate: Ratio of active agents (actually used in the past 30 days) to total registered agents.
Target: โฅ80%.
Rationale: Low utilization suggests agents are being registered and forgotten, which is a governance risk. Unused agents still need to be maintained, evaluated, and monitored. Agents with utilization near zero should be reviewed for retirement.
Governance Process Cycle Time: Average time for each governance process (new agent approval, evaluation review, policy update) from initiation to completion.
Tracking: Month-over-month trend.
Rationale: If governance processes are getting slower, the governance function is not scaling with the fleet. Slowdown is a leading indicator of governance bottleneck that, if not addressed, results in teams finding ways to route around the governance process.
Cost per Governed Agent: Total cost of the governance function (staff, tooling, process overhead) divided by the number of actively governed agents.
Target: Declining over time as the fleet grows (governance function should exhibit scale economies).
Rationale: If governance cost per agent is flat or rising as the fleet grows, the governance model is not scaling appropriately.
Component 10: Implementation Roadmap
Building all of this at once is not realistic. The roadmap below sequences the work in a way that delivers value incrementally while building toward a mature governance function.
Month 1: Foundation
Week 1-2: Registry Bootstrap
- Define the registry schema (adapt the schema from this post to your context)
- Implement the registry in whatever system you have access to (a spreadsheet, a Notion database, or a purpose-built tool like Armalo all work โ the discipline matters more than the tool initially)
- Conduct an inventory of all currently deployed agents: who deployed them, what do they do, who owns them
- Populate the registry with existing agents
- Identify gaps (agents with no clear owner, no policy documentation, no evaluation records)
Week 3-4: RACI and Naming
- Define the RACI matrix for your organization (adapt from this post)
- Assign owners for all registered agents
- Establish naming convention:
[team]-[purpose]-[environment] (e.g., cx-refund-eligibility-prod)
- Document escalation contacts and channels for each agent
- Brief all agent owners on their responsibilities
Deliverables by end of Month 1:
- Registry containing all known agents with owners assigned
- RACI documentation reviewed and accepted by all role holders
- Naming convention documented and applied
- List of governance gaps to address in subsequent months
Month 2: Policy and Financial Controls
Policy hierarchy definition:
- Draft platform-level policy (the absolute constraints that apply to all agents)
- For each major org/division, draft organizational-level policy
- Define the process for agent-level behavioral pacts
- Conduct legal and compliance review of draft policies
Budget authority matrix:
- Define agent financial tier classifications for your context
- Build budget delegation matrix
- Document approval process for each tier
- Retroactively classify all existing agents by financial tier
- Identify any agents operating outside their authorized financial tier and remediate
Deliverables by end of Month 2:
- Policy hierarchy documented and approved
- Budget delegation matrix documented and accepted by finance
- All agents classified by financial tier
- Any financial control gaps identified and remediation in progress
Month 3: Evaluation Cadence and Incident Playbook
Evaluation infrastructure:
- Define evaluation criteria and scoring methodology for each agent category
- Establish evaluation tooling (this is where purpose-built tooling earns its cost)
- Define acceptance thresholds by risk tier
- Run baseline evaluations for all registered agents
- Set review cycle for each agent based on risk tier
Incident playbook:
- Define severity classification criteria
- Document escalation paths
- Write incident response runbooks for each severity level
- Assign on-call rotation
- Test incident response with a tabletop exercise
- Document fleet-wide pause capability and test it in staging
Deliverables by end of Month 3:
- All active agents have baseline evaluation scores on record
- Evaluation cadence scheduled
- Incident playbook documented and reviewed
- On-call rotation established
- Fleet-wide pause tested
CoE structure:
- Formally charter the platform team
- Hire or assign governance engineers (target: 1 per 20 agents in the fleet)
- Charter the governance board
- Schedule recurring governance meetings (Tier 1 daily, Tier 2 weekly, Tier 3 monthly)
Tooling:
- Select and implement registry tooling (or migrate from initial spreadsheet to purpose-built system)
- Implement governance dashboards (fleet health score, time-to-X metrics, coverage metrics)
- Automate evaluation freshness alerts
- Implement cost monitoring per agent
- Configure anomaly detection
Deliverables by end of Month 4:
- CoE formally constituted and operational
- Governance meetings running on schedule
- Registry in purpose-built tooling
- Dashboards live and reviewed at each meeting
Month 5: First Fleet Audit
Full fleet audit:
- Registry completeness audit: every agent in production has a complete, current registry entry
- Evaluation currency audit: every agent has a current evaluation
- Policy compliance audit: every agent has a current behavioral pact consistent with policy hierarchy
- Financial controls audit: every agent's financial tier is documented and consistent with authorization records
- Incident preparedness audit: incident response was tested for each risk tier
Gap closure:
- Any agents without owners: assign owners and brief them
- Any agents without evaluations: run evaluations and establish cadence
- Any agents without behavioral pacts: draft and approve pacts
- Any agents whose financial authority is undocumented: document and get required approvals
Deliverables by end of Month 5:
- First fleet audit report
- Documented gap closure plan with owners and dates
- Governance function operating on standard cadence
Month 6: Board-Level Reporting
Executive dashboard:
- Build the quarterly board presentation template
- Define which metrics are reported at executive level vs. governance board level vs. operational level
- Establish data collection and reporting process for executive dashboard
- Present first quarterly report to executive sponsor
Continuous improvement:
- Review governance function efficiency metrics
- Identify process improvements based on first six months of operation
- Document lessons learned
- Update registry schema, RACI, and policy based on experience
Deliverables by end of Month 6:
- Quarterly board reporting template and process
- First executive dashboard presented
- Governance function improvement plan based on six-month review
- Updated documentation package
The 100-Agent Transition: What Actually Changes
It is worth being concrete about the transition moment โ the period when a fleet crosses from manageable without formal structure to requiring it.
At this stage, informal governance often works adequately:
- The original builders are still involved and know the system
- Communication is direct โ everyone who needs to know something can be reached in Slack
- Incidents are visible to everyone and response is instinctive
- The registry might be a shared doc that someone actually reads
- Budget is small enough that cost overruns are immediately noticed
The risk at this stage is complacency. Things are working, so there is no urgency to build structure. This is precisely when the structure should be built โ while the fleet is small enough that you can get everything documented without it being overwhelming.
The Dangerous Middle (20-50 agents)
This is where most organizations get into trouble. The fleet is large enough that informal governance is failing, but it has not failed dramatically enough that anyone demands formal governance. The visible symptoms:
- "Wait, who owns that agent?" becomes a question with no clean answer
- A new team member deploys an agent that does something similar to an existing agent they didn't know about
- An agent evaluation runs late because the owner is on vacation and there was no backup
- A minor incident takes longer to resolve than it should because the escalation path was not documented
- Two teams build similar tooling independently because there was no central platform
These are all symptoms of an informal governance model that has been outgrown. They are also not individually catastrophic, which is why they do not generate the urgency to fix them.
At 100 Agents: The Breaking Point
By the time a fleet reaches 100 agents, the informal model is definitively broken. The symptoms become undeniable:
- Nobody can name all the agents and what they do
- When a new regulation comes in, nobody can quickly identify which agents are affected
- An incident involving one agent triggers a full-day investigation because the relationships between agents were never documented
- A model update from a provider causes unexpected behavior changes across dozens of agents before anyone realizes they share the same model
- The annual audit becomes a weeks-long fire drill to assemble documentation that should have been kept current
- Budget review reveals agents nobody recognized spending money that nobody authorized
At 100 agents without formal governance, you are flying blind. You are also, increasingly, accountable. Regulators, customers, and boards are asking questions about AI governance, and "we're still figuring out the structure" is no longer an acceptable answer.
Common Governance Anti-Patterns
Building a governance function also requires knowing what not to do. These are the most common failure modes:
The Dashboard Fallacy
Organizations invest in a beautiful governance dashboard that shows all the right metrics in real time. Nobody reviews it regularly. Nobody acts on what it shows. The dashboard becomes proof that governance exists without providing any of the benefit.
Governance is a process, not a tool. The dashboard is useful only if someone is looking at it, someone has the authority to act on what they see, and the process around the dashboard is as well-defined as the dashboard itself.
The Approval Bottleneck
A well-intentioned governance board becomes a bottleneck that teams route around. New agents are deployed in development environments that somehow stay running. Agents are described as "low risk" regardless of their actual risk profile to avoid the slower approval process. Shadow deployments proliferate.
The solution is tiered approval with appropriate cycle times at each tier. Low-risk agents should be approvable in days, not weeks. Governance should be designed to be the path of least resistance for doing the right thing.
Evaluation Theater
Evaluations are run, scores are recorded, reports are generated โ but the evaluations do not actually measure the things that matter for this agent's risk profile. A customer service agent gets evaluated on response coherence but not on whether it makes representations consistent with current policy. A financial agent gets evaluated on math accuracy but not on whether it correctly identifies when human review is required.
Evaluations need to be tailored to risk. Generic evaluations are better than nothing but are not adequate for high-stakes agents.
The Stale Registry
The registry is comprehensive at deployment time and then immediately starts decaying. Model versions change. Owners change. Data access levels change. Integrations are added or removed. Nobody updates the registry because updating the registry was never explicitly anyone's job.
The solution is to make registry updates mandatory at specific trigger points (the mandatory review triggers listed in the registry section) and to measure registry accuracy as a governance KPI.
Single Owner Single Point of Failure
An agent's owner leaves the company. The agent is effectively ungoverned until someone notices โ which might be at the next scheduled evaluation, or during an incident, or during an audit. The agent has been running without accountable oversight for months.
Every agent needs a backup owner. Backup owner assignment is part of the initial registration process, not an afterthought.
The Governance Exception That Becomes the Rule
A team needs to deploy an agent quickly and gets a governance exception for expedited review. This works once. Then a second team requests the same exception. Then expedited review becomes the standard process for anyone who asks for it, and the governance timeline expectations are never calibrated correctly.
Exceptions must be exceptional. Every exception should require explicit justification, explicit time limits, and a commitment to complete the standard review within a defined window. Exception rates should be tracked as a governance metric.
Armalo's Role in the Governance Stack
The operating model described in this post is architecture โ the organizational design, the process definitions, the role assignments, the decision frameworks. Implementing it requires tooling that supports the operational needs: a registry that is queryable and auditable, evaluation infrastructure that produces structured, comparable results, behavioral pacts that serve as the documented contract between agent owners and the governance function, and a trust oracle that other systems can query to determine whether a given agent has demonstrated the behavioral standards required for a given context.
This is precisely the problem Armalo was built to solve. The registry is native. Behavioral pacts are a first-class concept โ not a document in Confluence, but a structured, versioned, auditable contract that captures what an agent is committed to doing and not doing, evaluated against adversarial tests, and scored on twelve behavioral dimensions. The evaluation infrastructure runs automatically against the pact, producing consistent scores that feed the trust oracle.
When a governance board needs to answer the question "which of our agents have demonstrated behavioral reliability above a threshold that meets our risk requirements," that answer is a single API call to the trust oracle. When an incident happens and the question is "what behavioral commitments did this agent make and how well was it meeting them," the audit trail is in the pact record.
For organizations building the governance infrastructure described in this post, Armalo provides the technical layer that makes the process tractable at scale. The operating model decisions โ the RACI, the policy hierarchy, the incident severity definitions โ those are organizational design choices that tools cannot make for you. But once those decisions are made, tools can make them dramatically less expensive to operate.
Getting Started: The Minimum Viable Governance Package
If your fleet is already at 50 agents and you have not yet built formal governance, here is the minimum viable governance package that addresses the highest-priority gaps:
In the next two weeks:
- Build a registry. Even a spreadsheet. Every agent, every owner, every purpose. This is the foundation of everything else.
- Assign a named owner to every agent. Do not move on until every row has an owner email.
- Define platform policy: the three to five absolute constraints that apply to every agent, everywhere. Write them down. Get them approved by whoever needs to approve them.
- Identify all Tier 2 and Tier 3 agents (those with financial authority). These are your highest-priority governance items.
In the next month:
5. Run evaluations for all Tier 2 and Tier 3 agents. Even one evaluation is better than zero.
6. Write behavioral pacts for all Tier 2 and Tier 3 agents.
7. Define your incident severity classification and the two or three most critical escalation paths.
8. Identify who has fleet-wide pause authority and ensure they can exercise it.
In the next quarter:
9. Expand the evaluation program to all agents.
10. Run your first fleet audit.
11. Establish the regular governance cadence.
12. Build or acquire the tooling to make the registry queryable and the evaluations comparable.
Perfect is the enemy of good in governance infrastructure, just as in most engineering work. A registry with some gaps is better than no registry. A governance process with some rough edges is better than no governance process. Start now, with what you have, and improve iteratively.
The alternative โ waiting until the governance infrastructure is fully designed before starting โ means waiting until after the first serious incident. Organizations that reach 100 agents without governance infrastructure do not usually get to implement it thoughtfully. They implement it reactively, under pressure, in response to something that should not have happened.
The Organizational Conversation You Need to Have
Governance infrastructure does not emerge from technical decisions alone. It requires organizational consensus on some questions that are genuinely difficult and that many organizations prefer to defer:
What is our risk appetite for agent autonomy? How much autonomous action, at what stakes levels, with what frequency, are we comfortable authorizing? This is a board-level question. The answer determines the entire structure of budget authority, evaluation thresholds, and human oversight requirements.
Who is ultimately accountable when an agent causes harm? Not "who is responsible for fixing it" โ that is process. Who is accountable, in the leadership accountability sense, for the consequences? The answer determines where in the org hierarchy agent ownership lives and how seriously the ownership role is taken.
What does "sufficient evaluation" mean? What evaluation score, on what dimensions, with what recency, is sufficient for an agent to operate without enhanced oversight? This is a technical question with an organizational answer โ it depends on the organization's risk tolerance and the consequences of agent failures in the relevant context.
How do we balance governance overhead against deployment velocity? Every governance control adds friction. Some friction is appropriate. Too much friction and teams route around the governance process, which produces worse outcomes than the friction it was trying to prevent. The right balance is organization-specific.
These conversations are uncomfortable partly because they force explicit acknowledgment that agents are taking actions on behalf of the organization, with organizational consequences, and that the organization is therefore accountable for those actions. Most organizations would rather believe that their AI tools are just helpful software that happens to do things. The governance model forces the acknowledgment that agents are organizational actors, operating under delegated authority, and that authority needs to be explicitly designed.
The organizations that have this conversation now, before a P0 incident makes it unavoidable, have a structural advantage. They build governance infrastructure when they can afford to do it thoughtfully. They make risk appetite decisions when they have the luxury of principled deliberation. They assign authority when they can do it based on structure rather than based on who happened to respond to an emergency.
Fleet governance is not bureaucracy. It is the organizational infrastructure that allows agent programs to scale without either the risk accumulating silently or the oversight costs growing linearly with the fleet. Done right, it enables speed โ because teams can deploy quickly when the governance process is well-defined and trusted, rather than slowly because every deployment is ad-hoc and everyone is nervous about what might go wrong.
Build the structure before you need it. That is the defining characteristic of organizations that operate large agent fleets successfully.