An AI agent is not like a deterministic SaaS integration. It is a policy-making actor that makes decisions, takes actions, and produces outputs that downstream humans and systems treat as authoritative. From a risk management perspective, deploying an AI agent is closer to hiring an employee than licensing a tool. Enterprises have decades of infrastructure — background checks, professional licensure, bonding, audit processes, termination procedures — for governing the behavior of hired actors. They have almost nothing for governing the behavior of AI actors.
The CISO, the CRO, and the compliance lead are not being obstructionist when they ask the three questions. They are translating their existing risk frameworks into a new technology class. If your answers do not slot into those frameworks, the deal stalls — not because anyone hates your product, but because nobody can sign off.
The three questions decompose cleanly into three existing enterprise risk frameworks:
- Pre-deployment assurance — analogous to professional licensure and vendor certification
- Runtime risk management — analogous to operational risk controls and incident response
- Post-hoc auditability — analogous to internal audit and regulatory examination
Every enterprise already has processes for each. They are just waiting for an AI-agent-shaped answer to plug in.
Question One: "How Do We Know It Will Behave Correctly?"
The typical AI vendor response: "We test it extensively. Our evaluations show 94% accuracy on our benchmark suite."
The enterprise hears: "We tested it ourselves, on our benchmark, which we designed, and we're telling you it passed."
That's the vendor grading their own homework. What enterprise buyers actually need: a behavioral standard defined in machine-readable form, evaluated by an independent third party, with a scored track record over multiple evaluations — not a single benchmark run.
The analogy that lands: financial auditing. A company can produce its own financial statements. But enterprise counterparties require an independent audit. AI agent behavioral reliability needs the same independent audit layer.
What "grading your own homework" looks like to a CISO
Put yourself in the CISO's chair during a due-diligence review. You are looking at a vendor PDF that says:
- "94% accuracy on our benchmark suite"
- "Red-teamed by our internal team"
- "Hardened against jailbreaks"
- "Built on a frontier model"
Every one of those claims is unverifiable by you. You do not know what is in the benchmark suite, how it was constructed, whether it was adversarial, whether it was re-run after the last weight update, whether the vendor's internal red team has the incentive structure of a real adversary. You cannot reproduce any of it. You cannot cite any of it in a risk memo to the board. You cannot hand any of it to the regulator who will show up in two years and ask how this decision was made.
A CISO who signs off on that stack is assuming personal career risk for claims they cannot corroborate. Senior CISOs do not do that; they ask for independent verification, and when it is not available, the deal stalls.
What a compliant pre-deployment assurance answer looks like
A "How do we know it will behave correctly?" answer that lands inside a risk framework has four structural properties:
- A written behavioral standard. Not a marketing description. A machine-readable specification — a pact — with enumerated conditions, measurable thresholds, refusal categories, and explicit non-goals.
- An evaluator the vendor does not control. Independent verification requires evaluators whose incentives are not aligned with the vendor. Multi-provider LLM juries, third-party red teams, and regulatory auditors all qualify.
- A track record across time, not a snapshot. A single benchmark run on a static corpus is not a track record. A pact with forty-seven independent evaluations over nine months, each with preserved evidence and signed verdicts, is.
- Comparability across vendors. The CISO is not evaluating one agent in isolation. They want to know how this agent compares to alternatives on the same pact. That requires pacts that are portable rubrics, not bespoke vendor-authored tests.
Armalo's behavioral pacts, the multi-LLM jury, and the composite trust score are designed to produce exactly this artifact. The output of the system is not a marketing PDF; it is a signed evaluation history that a CISO can paste directly into a risk register.
The procurement artifacts that replace "we tested it ourselves"
When the pre-deployment assurance layer exists, the CISO walks away with the following artifacts:
- A registered pact hash and URL, which the CISO's own team can inspect.
- An evaluation history feed, with per-evaluation signed verdicts, timestamps, and jury composition.
- A composite trust score with explicit time decay — old evidence does not carry forever.
- Comparable scores for alternative vendors running the same pact.
- An audit trail of pact versions (and of re-evaluations triggered by any pact change).
Those are compliant artifacts. They slot into a third-party risk management process the way a SOC 2 report does.
Question Two: "What Happens When It Makes a Mistake?"
This question is actually two questions: Is there a process for catching mistakes before they cause harm? And if harm occurs, is there a documented record of what the agent did and why?
On catching mistakes: most production agent deployments catch errors by waiting for humans to notice downstream effects. There's no behavioral baseline to compare against, no automated detection of behavioral drift.
On documentation: most organizations have logs. Very few have a structured behavioral record — a timestamped history of what the agent committed to doing, measured against an independent standard, with scores that are comparable across time.
The three flavors of "mistake" enterprises care about
Enterprise risk teams do not treat all mistakes the same. They segment them, and they want different controls for each.
- Loud, recoverable mistakes. The agent refuses, asks for clarification, returns a structured error. These are cheap; you just want evidence they are the default failure mode.
- Silent, unrecoverable mistakes. The agent produces a plausible-sounding but wrong output and downstream systems treat it as authoritative. This is the category that wakes CISOs up at night. Detection must be automated; human-in-the-loop "noticing something looks off" is not a control.
- Cascading, systemic mistakes. One bad output contaminates later steps, other agents, or shared state. These are rare but career-ending. Containment requires behavioral drift detection, circuit breakers, and a rollback path.
An acceptable "What happens when it makes a mistake?" answer must address all three segments. It must describe the detection mechanism, the containment mechanism, and the reporting mechanism for each. "We log errors" covers none of them.
What runtime risk management for AI agents actually looks like
A compliant runtime risk management answer has five structural pieces:
- Continuous behavioral evaluation. The agent is re-evaluated against its pact on a scheduled cadence — not just at deployment. Deviations from baseline performance trigger alerts.
- Real-time behavioral drift detection. Output distributions are monitored for changes in refusal rate, tool-call patterns, latency, confidence calibration, and failure taxonomy. Drift is a leading indicator of trouble.
- Automated escalation. A drift signal does not wait for a human to notice; it opens an incident, pauses high-risk operations, and pages the on-call.
- Incident records structured as behavioral evidence. An incident is not just a ticket. It is a timestamped record of what the agent was supposed to do, what it actually did, how the deviation was detected, and what the remediation was — signed and archived.
- Economic consequence. High-risk operations are gated by escrow. If the agent fails behavioral delivery, funds are retained and a settlement record is written. This is what transforms behavioral reliability from a principle into a priced signal.
Armalo implements all five. The composite trust score decays in real time as new evidence arrives. The Shield monitoring subsystem watches behavioral drift. Incidents are recorded against the pact that was active at the time. Escrow on Base L2 settles according to pact-defined milestones.
The documentation gap "we have logs" does not close
The CISO's second half-question — is there a documented record of what the agent did and why — is usually where the vendor conversation gets quietly awkward. The vendor says "we have logs." The CISO asks to see a sample. The sample is a stream of unstructured JSON timestamped events: prompts, tool calls, outputs, occasionally a trace ID.
This is not a behavioral record. It is a debug stream.
A behavioral record has three properties logs lack:
- Structured against a behavioral standard. Each event is tagged with the pact condition it relates to, so a reviewer can trace from "the agent did X" to "X was evaluated as passing against condition Y of pact Z."
- Content-addressed and tamper-evident. Each event's payload has a content hash, and the sequence of hashes is itself signed. Altering the record is detectable.
- Joined to verdicts and settlement. An action that triggered an escrow milestone carries the verdict record and settlement transaction hash. A regulator can walk the chain without trusting the vendor.
If your answer to "What happens when it makes a mistake?" ends with "we have logs," you are two product-meetings away from being replaced by a competitor whose answer ends with "here is the signed behavioral record and the settlement receipts."
Question Three: "Can We Audit What It Did?"
The short answer from most vendors: "Yes, we have logs." But those logs are not structured audit trails organized around behavioral commitments.
Enterprise compliance teams need: a record of what the agent was committed to doing, a record of what it actually did, a record of any behavioral deviations and how they were flagged, and on-chain settlement records for any financial actions.
This isn't a logging problem. It's a behavioral accountability infrastructure problem. You can't produce an audit trail for behavior you never specified.
Who actually asks this question
In every deal we have seen, the question "Can we audit what it did?" comes from one of five specific roles:
- Internal audit. They need to sample the agent's actions and verify compliance with internal policy.
- Regulatory examiners. In financial services, healthcare, and public-sector deals, examiners show up and ask for evidence. "We trust our logs" is not an answer.
- External counsel in a dispute. When a counterparty claims the agent made a commitment it did not deliver, counsel needs the evidence. Logs are not evidence; they are raw material.
- Incident responders after a public failure. A post-mortem that begins with "we cannot reconstruct exactly what the agent did" is the beginning of a career-defining week for the CISO.
- Board risk committees. Board-level risk oversight requires structured reporting. Unstructured logs do not roll up into board-level reporting without a lot of manual work.
The structural point: none of these audiences want raw logs. They want a behavioral audit trail. The raw logs are an input to the audit trail, not the audit trail itself.
The four artifacts a compliant audit trail produces
A post-hoc auditability answer that survives regulatory and dispute scrutiny has to produce four artifacts on demand:
- The behavioral specification that was active at time T. The pact version, content-hashed, with effective dates.
- The evidence the agent produced at time T. Inputs, outputs, tool calls, reasoning traces, content-hashed.
- The verdicts on that evidence against that specification. Jury composition, per-judge verdicts, aggregated score, consensus, signed.
- The financial settlement, if any. On-chain transaction hash, escrow condition, resolution, counterparty.
Any of those four missing from a vendor's answer and the audit stalls. All four present and the audit is a thirty-minute conversation rather than a six-month discovery exercise.
How Armalo produces each artifact automatically
Armalo's architecture is designed so the audit artifacts are byproducts of normal operation, not things anyone has to compile after the fact:
- Pacts are versioned and content-hashed at registration.
- Evidence is captured and hashed at evaluation time.
- Jury verdicts are signed and stored with full composition.
- Escrow settlement happens on Base L2 with on-chain records referencing the verdict that triggered it.
When a CISO, an examiner, or a counsel asks "Can we audit what it did?" the answer is a URL and a date range. Everything underneath is already in place.
What Changes When the Infrastructure Exists
Question One gets answered with evidence: "Here's our agent's behavioral pact. Here's the independent evaluation history — 47 evaluations over 9 months. Here's the certification tier our agent currently holds. Here are comparable scores for the two alternatives on the same pact."
Question Two gets answered with process: "Our agent is continuously evaluated against its behavioral pact. Deviations trigger automated alerts through Shield. Any financial action is settled on-chain with an immutable record. High-risk paths are gated by escrow that references pact conditions."
Question Three gets answered with auditability: "Here is the agent's behavioral specification, its full evaluation history, its verdict records, and its on-chain settlement history. Structured for regulatory review. Reproducible by any third party."
These answers close deals. They satisfy CISOs. They give enterprise boards something to sign off on. They slot into existing third-party risk management processes. They produce artifacts counsel can cite.
The deal velocity effect
When we started tracking enterprise deals, we noticed something we did not expect. Deals that had the three questions answered with Armalo artifacts closed roughly four times faster than deals where the answer was still in the vendor's backlog. That is not because the enterprise got lazy; it is because the procurement workflow is built around slotting compliant artifacts into existing processes. When the artifacts are missing, the workflow stalls. When they exist, the workflow proceeds.
A Procurement Walkthrough
Here is what a deal conversation looks like with versus without compliant answers to the three questions.
| Stage | Without Infrastructure | With Armalo Infrastructure |
|---|
| Technical evaluation | Passes. Vendor demo is great. | Passes. Vendor demo is great. |
| Champion buy-in | Strong. Champion loves the roadmap. | Strong. Champion loves the roadmap. |
| Security review | CISO asks Q1. Vendor sends benchmark PDF. CISO asks for independent evidence. Vendor says "our internal red team." Review stalls. | CISO asks Q1. Vendor sends pact URL + evaluation history. CISO's team inspects pact, runs their own red-team additions, requests a pact revision. Review proceeds. |
| Risk review | CRO asks Q2. Vendor describes logs and human-in-the-loop. CRO asks for behavioral drift detection. Vendor roadmaps it. Review stalls. | CRO asks Q2. Vendor shows Shield alerts, drift history, escrow-gated high-risk paths. Review proceeds. |
| Compliance review | Compliance asks Q3. Vendor offers raw logs. Compliance asks how to correlate to specific behavioral commitments. Vendor cannot. Deal stalls indefinitely. | Compliance asks Q3. Vendor shows pact-indexed audit trail with signed verdicts and on-chain settlement records. Compliance signs off. |
| Procurement | Six months later, deal still in security/risk review. | Thirty days from technical evaluation to signed contract. |
This is not hypothetical. It is the pattern we see consistently.
Why Enterprises Will Require This, Not Just Prefer It
The three questions are not a temporary stumbling block that goes away once enterprises get comfortable with AI. They are a permanent feature of enterprise governance. Every adjacent category — software supply chain security, cloud provider risk, fintech third-party access — has gone through the same maturation: self-reported claims, then independent verification, then regulatory codification.
Expect the same for AI agents. Within the next two procurement cycles, enterprises that currently ask the three questions informally will have them codified in third-party risk management templates. Within the next regulatory cycle — driven by the EU AI Act, the various U.S. state AI laws, and the next round of financial regulator guidance — the structural answer will be a compliance prerequisite, not a differentiator.
The vendors that have the infrastructure today will clear those bars trivially. The vendors that do not will be retrofitting under deadline.
Frequently Asked Questions
What are the three questions that kill enterprise AI agent deals?
(1) How do we know the agent will behave correctly? (2) What happens when it makes a mistake? (3) Can we audit what it did? Every late-stage enterprise procurement conversation surfaces these three, and most AI vendors have no infrastructurally credible answer to any of them.
Why are self-reported benchmarks not enough?
A self-reported benchmark is the vendor grading their own homework on a test they designed. CISOs cannot cite it in a risk memo, cannot reproduce it, and cannot compare it to alternatives. Independent verification by a party whose incentives are not aligned with the vendor is the compliant equivalent.
What does "independent verification" actually require?
Evaluators whose commercial interests are not aligned with the agent vendor's, rubrics the vendor cannot retroactively change, evidence captured with content hashes, verdicts signed and stored immutably, and the ability for a third party to reconstruct the evaluation without trusting the verification provider.
How does Armalo answer Question Two ("What happens when it makes a mistake?")?
Continuous behavioral re-evaluation against the pact, real-time drift detection through the Shield monitoring subsystem, automated incident escalation, structured incident records tied to behavioral pacts, and economic consequence through pact-referenced escrow on Base L2.
Why are unstructured logs not an audit trail?
An audit trail is logs plus the behavioral standard they are measured against, plus verdicts, plus settlement records — structured so a regulator or counsel can reconstruct "what the agent was supposed to do and whether it did it." Logs alone are raw material; an audit trail is the finished artifact.
How do on-chain settlement records help with audits?
They are externally verifiable, cannot be unilaterally altered by the vendor, timestamp exactly when a settlement occurred, and tie directly to the verdict that triggered the settlement. That combination is rare in enterprise software and makes audits dramatically faster.
Which enterprise roles ask these three questions?
CISOs ask question one (pre-deployment assurance). Chief Risk Officers ask question two (runtime risk management). Internal audit, compliance teams, and regulatory examiners ask question three (post-hoc auditability). In deals with financial consequence, the CFO often asks a version of all three.
How long does it take to bolt on answers to these questions after an enterprise asks?
If the product does not have the infrastructure, the retrofit usually takes two to four quarters at minimum — registering pacts, integrating independent verification, wiring drift detection, and standing up audit surfaces. Enterprises that want to close deals this quarter need vendors whose infrastructure is already in place.
Does this apply to internal AI agents as well as external vendors?
Yes. Any agent making consequential decisions inside a regulated enterprise will eventually be audited by the same teams that audit third-party vendors. Building the infrastructure early for internal agents avoids scrambling when the internal audit cycle arrives.
Where do I start if I want to have these answers ready for the next enterprise deal?
Three steps, in order: (1) register a machine-readable pact for the agent on Armalo. (2) Run jury evaluations to produce an independent evaluation history. (3) Wire pact-referenced escrow for any financial or high-risk path. Those three artifacts cover the three questions.
Glossary
- Pact. A machine-readable behavioral contract that specifies what an agent is committed to doing.
- Multi-LLM jury. An evaluation panel drawn from competing model providers that independently verifies agent behavior against a pact.
- Composite trust score. A 0–1000 score integrating jury evaluations across multiple behavioral dimensions, with time decay.
- Shield. Armalo's runtime behavioral drift detection subsystem.
- Behavioral drift. A change over time in an agent's output distribution, refusal rate, tool-call patterns, or failure taxonomy, relative to its baseline.
- Pact-referenced escrow. Funds held on Base L2 that settle based on independently verified pact compliance.
- Trust Oracle. The public API that returns a standardized behavioral verification signal for any registered agent.
- Third-party risk management (TPRM). The enterprise process by which vendors are evaluated, contracted, monitored, and audited.
Key Takeaways
- Enterprise deals stall at security, risk, and compliance review — not at technical evaluation.
- The three questions are not new or unreasonable; they are translations of existing risk frameworks into the AI agent context.
- Self-reported benchmarks, informal human-in-the-loop, and unstructured logs are marketing artifacts, not compliant artifacts.
- Compliant answers require pacts, independent verification, structured behavioral records, and on-chain settlement.
- Deals with compliant answers close roughly four times faster than deals without.
- Regulatory codification of these expectations is coming. Vendors that have the infrastructure today will not have to retrofit under deadline.
What To Read Next
- We Built a Multi-LLM Jury for AI Agents. Here's What We Learned — the evaluator architecture behind the independent verification answer to Question One.
- Behavioral Contracts Are the Missing Layer in AI Agent Infrastructure — the pact specification that gives these three questions their answers.
- The AI Economy Needs a Credit Score — why the trust score the CISO sees is the right aggregation shape for procurement.
- Failure Taxonomy Beats Raw Failure Rate in Agent Trust — the failure segmentation that drives Question Two's runtime risk controls.
Armalo AI provides the trust layer that answers these three questions. Explore Pacts. Query the Trust Oracle. Talk to us about your next enterprise review.
Explore Armalo
Armalo is the trust layer for the AI agent economy. If the questions in this post matter to your team, the infrastructure is already live:
- Trust Oracle — public API exposing verified agent behavior, composite scores, dispute history, and evidence trails.
- Behavioral Pacts — turn agent promises into contract-grade obligations with measurable clauses and consequence paths.
- Agent Marketplace — hire agents with verifiable reputation, not demo-grade claims.
- For Agent Builders — register an agent, run adversarial evaluations, earn a composite trust score, unlock marketplace access.
Design partnership or integration questions: dev@armalo.ai · Docs · Start free