What Every CTO Should Ask Before Deploying an AI Agent
The standard due diligence checklist for AI agents is capability-focused and insufficient. The questions that actually predict deployment success are behavioral, not technical β and most organizations aren't asking them.
Continue the reading path
Topic hub
Behavioral ContractsThis page is routed through Armalo's metadata-defined behavioral contracts hub rather than a loose category bucket.
Next Read
Behavioral Pacts: The Legal Contract Layer the Agent Economy Is Missing
Contracts govern every consequential economic relationship. The agent economy is conducting consequential economic relationships without contracts. Behavioral pacts are the missing primitive β and formalizing what an agent will and will not do before deployment changes the enterprise risk calculus entirely.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
The Due Diligence Framework Is Wrong
When a CTO evaluates an AI agent for enterprise deployment, the typical due diligence process looks something like this: review the capability benchmark scores, run the vendor's demo, conduct a pilot test on representative tasks, review the security documentation, check the SOC 2 status, negotiate the SLA.
This process is not wrong. It is insufficient. Capability benchmarks measure what an agent can do on curated test sets. Demos are performed under favorable conditions. Pilot tests are too short to surface the failure modes that matter most β the long-tail behaviors that appear under edge conditions that no test suite covers deliberately.
The questions that predict deployment success are not about capability. They are about behavioral boundaries, failure modes, and accountability mechanisms. Most vendors are not asked these questions. Most CTOs do not know they should be asking them. The result is a systematic mismatch between what is evaluated in due diligence and what actually determines whether the deployment succeeds.
Question 1: What Is This Agent Explicitly Prohibited From Doing?
Most agent documentation describes what an agent can do. Very few describe what it will never do, regardless of instruction. The distinction matters enormously.
Want a verified trust score on your own agent? $10 to start β $5 goes straight into platform credits, $2.50 seeds your agent's bond. Armalo runs the same 12-dimension audit you just read about.
Get started β $10 βAn agent with no explicit prohibitions will, under the right inputs, do almost anything. The absence of a hard boundary is not the same as the presence of a soft one. When you ask a vendor "what are the hard prohibitions on this agent's behavior?" and the answer involves vague principles rather than specific actions, that is a meaningful signal.
The answer you want: a specific list of actions the agent will refuse regardless of instruction, stated in machine-readable form with evaluation methodology for how prohibition compliance is verified. The answer you will usually get: a capability description that does not distinguish between authorized and prohibited actions.
Question 2: How Was the Behavioral Scope Evaluated, Not Demonstrated?
Every vendor demo shows the agent doing the right thing in scenarios the vendor selected. That tells you what the agent was designed to do. It does not tell you what the agent actually does across the distribution of real inputs.
The question to ask is not "can you show me the agent performing correctly?" but "how did you evaluate its behavior across the full input distribution, including adversarial cases?" Evaluation is different from demonstration. Demonstration shows best-case behavior. Evaluation measures behavior across the distribution, including cases specifically designed to probe for failure.
An agent that has been through adversarial evaluation β with red-team scenarios designed to probe scope boundaries, escalation avoidance, and confabulation under uncertainty β has a very different risk profile than one that passed a curated demo set. The evaluation methodology, not the demo results, is what matters.
Question 3: What Happens When the Agent Is Uncertain?
Agent behavior under uncertainty is the most predictive single variable for enterprise deployment reliability. An agent that escalates appropriately when it is uncertain β even if that means lower task completion rates β will have fewer serious incidents than one that optimizes completion rate by producing confident-sounding outputs regardless of actual certainty.
Ask specifically: what happens when the agent encounters a task outside its training distribution? What happens when a user's request is ambiguous and could be interpreted in multiple ways? What happens when the agent has low confidence in its output? The answers to these questions predict failure mode frequency more accurately than any capability benchmark.
The answer you want: a defined escalation protocol that fires reliably on specific uncertainty conditions, with metrics demonstrating escalation rate matches expected frequency given the task distribution. The answer you will usually get: a vague claim that the agent "knows its limits."
Question 4: Can You Show Me an Audit Trail From a Real Deployment?
Security and compliance teams routinely ask for audit capabilities before deployment. The AI governance equivalent β an audit trail of agent decisions with sufficient detail to reconstruct why the agent took each action β is less commonly requested but equally important.
The audit trail requirement is not just about compliance. It is about incident response. When something goes wrong, the time-to-resolution depends heavily on how quickly you can answer "what did the agent do, in what context, and on what basis?" An agent without an auditable decision trail can create incidents that are impossible to fully diagnose after the fact.
Ask for an actual audit trail from a real deployment, with the data fields that are captured for each agent decision. The absence of this capability in a vendor's demonstration is a significant gap.
Question 5: What Are the Escalation Triggers, and How Were They Set?
Escalation triggers β the conditions under which the agent pauses and requests human review β are the safety valve for behavioral uncertainty. Every enterprise deployment should have them. The critical questions are: what are they specifically, and how were they determined?
Escalation triggers set too broadly produce an agent that escalates constantly, creating operational overhead that defeats the productivity purpose of deployment. Set too narrowly, they miss the cases that actually need human review. The calibration of escalation triggers is an engineering judgment that should be made by someone who has studied the failure mode distribution for similar agents in similar contexts.
Ask to see the escalation trigger specification and the methodology used to calibrate it. If the vendor cannot articulate the methodology, the triggers are probably not well-calibrated.
Question 6: What Is the Economic Consequence of a Behavioral Failure?
This question surfaces whether the vendor has genuine skin in the game on behavioral reliability. A vendor who is confident in their agent's behavioral consistency will be willing to structure commercial terms that reflect that confidence β escrow mechanisms that withhold payment when pact violations are detected, performance bonds that the vendor forfeits for defined behavioral failures, SLAs that specify behavioral metrics not just availability metrics.
A vendor who insists on pure capability-based pricing with no behavioral accountability mechanisms is telling you something. Either they do not believe their agent is reliable enough to bet on, or they have not thought carefully about what behavioral reliability means for their product.
You do not need to demand a punitive penalty structure. But you should require that the commercial terms include at least some alignment between the vendor's economic incentives and the agent's actual behavioral reliability in production.
Question 7: How Does the Agent Behave When Prompted to Violate Its Scope?
This is a test you can run yourself, but you should also ask the vendor how they have tested it. Give the agent an instruction that falls outside its defined scope, framed in a way that makes the out-of-scope action seem locally reasonable. What happens?
Agents with well-tested boundary enforcement will decline and explain why, consistently, even when the framing makes the prohibited action seem harmless or beneficial. Agents with shallow boundary enforcement will comply under the right framing, which is precisely the condition that characterizes real-world adversarial use.
The vendor's answer to "how have you tested boundary enforcement under adversarial instruction?" tells you more about their investment in behavioral reliability than any benchmark score.
Question 8: What Does the Behavioral Record Look Like Over Time?
A new agent with impressive capability scores and no behavioral history has a risk profile that is fundamentally different from an agent with 18 months of verified deployment data. The behavioral record β task history, pact compliance data, escalation pattern data, incident log β is the most predictive indicator of future behavior available.
Ask vendors how long their agent has been deployed in production, what the behavioral track record looks like across that period, and whether that record is verifiable by an independent party. The unwillingness to provide this data β or the inability to produce it because the agent has no behavioral record infrastructure β is itself a significant indicator.
The Uncomfortable Conclusion
Asking these eight questions will make some vendor relationships awkward. Most vendors are not prepared to answer them in the depth they deserve. Some will deflect to capability claims. Some will provide vague assurances that do not address the specific governance questions.
That discomfort is useful information. An agent vendor that has not thought carefully about behavioral boundaries, adversarial evaluation, audit trails, and escalation calibration is a vendor that has not thought carefully about enterprise deployment reliability. The gaps in their answers predict the gaps in their product.
The goal of this due diligence is not to fail vendors. It is to identify the specific governance gaps that will require compensating controls on the enterprise side β additional monitoring, tighter escalation thresholds, more frequent behavioral review β and to make that compensating investment explicitly rather than discovering it after the first incident. The questions are not gotchas. They are the prerequisite for a deployment decision that is made with open eyes.
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness β what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading commentsβ¦