From Prototype to Production: The AI Agent Deployment Checklist
A practical, opinionated 12-step checklist for deploying AI agents to production. Not generic best practices — specific to autonomous agents with real-world authority and real failure modes.
Most AI agent deployment guides read like they were written for demos. They cover model selection, prompt engineering, basic error handling, and tool integration — the things that matter for getting an agent working in a controlled environment. They don't cover the things that matter for keeping an agent working reliably in production, maintaining accountability when something goes wrong, and scaling confidently to more agents and more complex tasks.
This checklist is different. It's written for engineers and product leaders who are deploying agents that have real-world authority — that can take actions, make decisions, and produce outputs that affect real people and real money. It's opinionated where opinions are warranted. It assumes you've already done the development work and you're ready to transition to production.
Twelve steps. For each: what to do, why it matters, and what goes wrong if you skip it.
TL;DR
- Identity before everything: A production agent without cryptographic identity has no accountability foundation — everything else is built on sand.
- Behavioral contracts must precede deployment: Defining what the agent should do is a prerequisite for detecting when it isn't.
- Evaluation baseline is pre-deployment, not post-incident: You need a quantitative baseline before you can measure drift — not after drift has caused damage.
- Financial guardrails prevent asymmetric risk: The worst-case scenario for an agent with financial access and no guardrails is catastrophic and potentially irreversible.
- Monitoring without evaluation is blind: Standard monitoring tells you the agent is running, not whether it's running correctly.
- Human escalation paths must be technical, not procedural: Policies that say "humans should review X" don't work at scale. Technical enforcement does.
Step 1: Register Cryptographic Agent Identity
What to do: Register the agent with a DID (Decentralized Identifier) that's linked to the organization's ownership. Generate a unique API key scoped to the agent's declared permissions. Configure all API calls to carry the agent's identity in request signatures.
Why it matters: Without cryptographic identity, every action the agent takes is attributed to the human user under whose credentials it operates. When an incident occurs, the investigation becomes a manual archaeology project through logs and timestamps. With cryptographic identity, every action is unambiguously attributed to a specific agent version, enabling automated audit trail generation and precise incident scoping.
What goes wrong if you skip it: Shared credentials make incident attribution difficult and regulatory compliance nearly impossible. You can't answer "which agent did this?" with confidence if agents share identities.
Step 2: Define Behavioral Contracts (Pacts)
What to do: Write behavioral pacts for each declared capability. Each pact condition must be specific (not "high accuracy" but "accuracy above 94% on the test case set, measured by deterministic comparison to reference outputs"), measurable (a specific metric with a threshold), and verifiable (automated evaluation is possible). Include failure mode conditions — not just happy path.
Why it matters: Pact conditions are the ground truth that your evaluation system monitors against. Without them, you're monitoring against an undefined standard, which is the same as not monitoring. Pacts also serve as the contract in disputes: if a buyer claims the agent underperformed, the pact defines what "underperformed" means.
What goes wrong if you skip it: Your monitoring system has no baseline to detect deviations from. Disputes with buyers become subjective arguments rather than objective evaluations against defined criteria. Your evaluation results are uninterpretable because you haven't defined what you're evaluating.
Step 3: Establish Evaluation Baseline
What to do: Before the agent goes live, run a comprehensive evaluation pass against all declared pact conditions. Record the results as your baseline: per-condition scores, aggregate composite score, dimensional breakdown. Store these results with version identifiers for both the agent and the evaluation harness.
Why it matters: You need a quantitative baseline to measure drift against. If the agent's accuracy score starts at 87% and drops to 78%, that's a 9-point decline that may or may not cross your alerting threshold. If you don't have a baseline, you don't know the score started at 87% — you only know it's at 78% now, which is meaningless without context.
What goes wrong if you skip it: Behavioral drift goes undetected until it's severe enough to trigger a monitoring alert or cause a user-visible failure. By that point, the damage may have been accumulating for weeks.
Step 4: Configure Financial Guardrails
What to do: Define the agent's authorized financial limits: maximum single-transaction amount, maximum daily spend, maximum escrow authorization, and actions that require human approval regardless of amount. Implement these as technical enforcement, not policy documentation. For agents with financial access, configure USDC escrow for outcome-based payment and a credibility bond sized relative to maximum single-transaction exposure.
Why it matters: AI agent failures that involve financial actions can be immediate and irreversible. An agent with unrestricted financial access and no guardrails creates asymmetric risk: the upside is operational efficiency, the downside is potentially catastrophic loss. Guardrails cap the downside without eliminating the upside.
What goes wrong if you skip it: A financial error by an unguarded agent can propagate before detection. The typical detection delay for AI agent financial errors (when monitoring is standard, not purpose-built) is 2-24 hours. The maximum financial damage in that window is determined by your authorization limits — or by the lack of them.
Step 5: Establish Continuous Evaluation Schedule
What to do: Configure automated evaluation runs to execute on a regular schedule (daily at minimum for high-stakes agents) and on production traffic samples (10-20% sample rate recommended as a starting point). Configure alerting: specific thresholds for each pact condition, aggregate composite score alerting, and behavioral drift alerting when any dimension moves more than X% in a 7-day window.
Why it matters: Pre-deployment evaluation tells you the agent is capable under ideal conditions. Continuous evaluation tells you it's remaining capable under production conditions. Model provider updates, distribution shifts in incoming requests, and gradual prompt degradation are all sources of behavioral drift that pre-deployment testing won't catch.
What goes wrong if you skip it: See the forensic analysis in "Anatomy of an AI Agent Failure." Silent behavioral drift that accumulates over days before accidental discovery, by which time the damage has propagated through dependent decisions.
Step 6: Configure Human Escalation Triggers
What to do: Define the specific conditions that route agent decisions to human review rather than autonomous execution. These must be technically enforced — the agent physically cannot take the action without human approval if the trigger condition is met. Common triggers: tasks above a materiality threshold, confidence scores below a configurable minimum, task types that are outside the agent's top pact condition scores, first-encounter patterns (tasks the agent has never handled before), and any output that the agent's self-assessment flags as uncertain.
Why it matters: "Humans should review high-stakes decisions" is a policy statement. Without technical enforcement, it relies on the agent respecting the policy (it has no awareness of organizational policy) and on humans being available and motivated to review. Technical enforcement makes escalation unconditional.
What goes wrong if you skip it: Agents will handle tasks they're not qualified for because there's no mechanism to stop them. This is how scope creep failures develop.
Step 7: Implement Rollback Capability
What to do: Maintain a certified rollback state for the agent — a version snapshot with known evaluation results that you can restore to immediately. Establish the procedure for triggering rollback: who has authority, what conditions trigger it, what the rollback procedure is technically, and how long the rollback state needs to be maintained before being replaced by a new certified baseline.
Why it matters: When something goes wrong in production, the fastest path to stopping the damage is restoring to a known-good state. Without a certified rollback target and a tested rollback procedure, "restore to previous version" becomes an hours-long investigation to determine what "previous version" means and whether it was actually better.
What goes wrong if you skip it: Incident response becomes significantly slower. Instead of executing a practiced rollback procedure, the team is making real-time decisions about what to roll back to while the issue continues.
Step 8: Establish Audit Logging
What to do: Implement immutable audit logging that captures for every agent action: agent identity (DID + version), timestamp, inputs, outputs, tool calls (with pre- and post-validation results), confidence scores, and any escalation events. Log storage must be write-once (append-only, no deletion without audit event). Configure log retention appropriate to your regulatory requirements (typically 3-7 years for financial and healthcare applications).
Why it matters: Audit logs are the evidence base for everything: incident investigations, regulatory inquiries, dispute resolution, and insurance claims. Logs that don't exist or can't be trusted (because they're mutable) don't serve any of these purposes.
What goes wrong if you skip it: Incident investigations become speculative. Regulatory inquiries generate discovery requests that can't be fulfilled. Disputes that should resolve in hours take weeks.
Step 9: Configure Certification and Renewal Schedule
What to do: Establish a certification schedule — a regular (typically quarterly) comprehensive evaluation pass that produces a signed certification artifact documenting the agent's current behavioral state. The certification artifact includes: evaluated pact conditions with results, composite score with dimensional breakdown, test harness version, and a human sign-off from whoever is operationally responsible for the agent.
Why it matters: Certification creates accountability checkpoints. Whoever signs off on the quarterly certification is attesting that the agent has been evaluated and meets its declared standards. This creates clear human accountability at regular intervals, not just at initial deployment.
What goes wrong if you skip it: "Nobody is responsible" becomes the default when things go wrong. Without explicit accountability checkpoints, responsibility diffuses. Certification creates a clear record of who attested to what and when.
Step 10: Define Incident Response Procedures
What to do: Document and test the incident response procedure before you need it. The procedure should include: detection triggers and who gets notified, severity classification criteria, immediate response actions (rollback vs. suspend vs. investigate), backward audit procedure for quantifying scope of an identified issue, communication templates for stakeholder notification, and post-incident review schedule.
Why it matters: Incident response quality degrades dramatically under pressure. Procedures that haven't been tested will be executed slowly and incorrectly when the actual incident occurs. Testing the procedures — tabletop exercises, controlled drills — is what makes them reliable.
What goes wrong if you skip it: Incidents take longer to contain, scope quantification is delayed, and stakeholder communication is inconsistent. Each of these compounds the damage.
Step 11: Register with Trust Infrastructure
What to do: Register the agent's identity and pact declarations with Armalo's trust layer. Stake a credibility bond sized relative to the agent's declared capability scope and financial authority. Configure the agent's public AgentCard with accurate capability declarations and current certification status. Enable cross-platform identity verification so counterparties can verify the agent's credentials independently.
Why it matters: Trust infrastructure registration enables your agent to participate in the agent economy — to be discovered, evaluated by counterparties, and engaged for work with financial accountability. An unregistered agent has no portable identity and no mechanism for counterparties to verify claims.
What goes wrong if you skip it: The agent exists only within your own ecosystem. It can't participate in agent marketplaces, can't engage in escrow-protected commerce, and can't build the portable behavioral track record that creates long-term competitive advantage.
Step 12: Establish Ongoing Performance Review
What to do: Schedule regular performance reviews (monthly for high-stakes agents, quarterly for lower-stakes) that go beyond the automated evaluation results. Reviews should examine: trends in evaluation scores across dimensions, patterns in escalation triggers (are certain task types generating disproportionate escalations?), anomalies in transaction reputation scores, and any near-misses or low-confidence incidents that didn't generate formal incidents but should be reviewed.
Why it matters: Automated evaluation catches metric-level issues. Performance reviews catch systemic issues that are visible in patterns but not in individual alerts: a gradual drift in a specific task type, an escalation pattern that suggests the agent's scope is too broad, a counterparty rating trend that precedes a reputation score drop.
What goes wrong if you skip it: Systemic problems accumulate without surfacing until they're severe. The agent equivalent of a process that's been degrading for a year before the year-end audit catches it.
Full Deployment Checklist
| Step | Deliverable | Owner | Risk If Skipped |
|---|---|---|---|
| 1. Identity registration | DID + scoped API key + attribution config | Engineering | Attribution impossible, audit fails |
| 2. Behavioral contracts | Signed pact document with 3+ verified conditions | Product + Engineering | No evaluation ground truth |
| 3. Evaluation baseline | Baseline report with scores per condition | Engineering | Drift undetectable |
| 4. Financial guardrails | Limit config + escrow setup + bond staked | Finance + Engineering | Asymmetric financial risk |
| 5. Continuous eval | Evaluation schedule + alert thresholds | Engineering | Silent degradation |
| 6. Escalation triggers | Technically enforced trigger conditions | Engineering | Scope creep failures |
| 7. Rollback capability | Certified rollback state + tested procedure | Engineering | Slow incident response |
| 8. Audit logging | Immutable append-only log + retention config | Engineering | Investigation impossible |
| 9. Certification schedule | Quarterly certification process with sign-off | Operations + Legal | No accountability checkpoints |
| 10. Incident response | Tested procedure + communication templates | Operations | Damage compounds during incidents |
| 11. Trust registration | Armalo registration + bond + AgentCard | Product | No portable identity or marketplace access |
| 12. Performance review | Monthly/quarterly review schedule | Operations | Systemic issues accumulate |
Frequently Asked Questions
How long does going through this checklist take for a new agent? For a well-documented agent with clear capability scope, steps 1-12 typically take 2-3 weeks of engineering and process work. The bottleneck is usually behavioral contract definition (step 2) — getting precise about what the agent should do is harder than it sounds. For organizations deploying multiple agents, later agents benefit from templates and infrastructure built for the first.
Can steps be done in parallel? Steps 1-3 need to be sequential (identity before contracts, contracts before baseline). Steps 4-8 can largely be done in parallel with each other. Steps 9-12 are ongoing processes that are configured before launch and run continuously after. The critical path is 1 → 2 → 3 → (4,5,6,7,8 in parallel) → launch → (9,10,11,12 ongoing).
What's the minimum viable checklist for a low-stakes agent? For genuinely low-stakes agents (no financial authority, narrow scope, human review on all outputs), you can start with steps 1, 2, 3, 5, and 8. The others are still important but the urgency is lower. As the agent's scope or authority increases, the remaining steps become mandatory.
How do you handle the checklist for an agent that's already in production? Start with the audit (step 8) — get logging working immediately because you need it for everything else. Then step 3 (establish a baseline from current performance). Then step 2 (formalize the behavioral contracts to match what the agent is actually doing). Then close the remaining gaps in priority order, starting with financial guardrails (step 4) and rollback capability (step 7).
Does every agent need its own credibility bond, or can a portfolio of agents share one? Every agent needs its own bond. A shared bond doesn't provide agent-level accountability — you can't determine which specific agent's performance failed to meet the bonded claim. The bond amount per agent is the variable; smaller-scope agents warrant smaller bonds.
Key Takeaways
-
Identity registration is the non-negotiable foundation: without cryptographic attribution, every other step's accountability value is undermined.
-
Behavioral contracts must be defined before deployment, not after an incident forces definition. "We don't know exactly what it should do" is a reason to delay deployment, not to deploy and learn later.
-
Financial guardrails create a hard cap on asymmetric risk — they don't prevent the agent from operating effectively, they just limit the worst-case downside of an agent failure.
-
Continuous evaluation is architecturally different from pre-deployment testing — it provides ongoing verification of production behavior, not just a one-time capability demonstration.
-
Human escalation paths must be technically enforced. Policy-based escalation fails because agents don't read policies and humans are subject to cognitive load and availability constraints.
-
Audit logging must be immutable and complete from day one. Retroactive logging is expensive, incomplete, and often legally insufficient for regulatory inquiries.
-
The 12 steps are not a checklist to complete once — they're an ongoing operational framework. Steps 9-12 in particular are continuous processes that sustain the accountability infrastructure over time.
Armalo Team is the engineering and research team behind Armalo AI, the trust layer for the AI agent economy. Armalo provides behavioral pacts, multi-LLM evaluation, composite trust scoring, and USDC escrow for AI agents. Learn more at armalo.ai.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.