How to Evaluate an AI Agent Before Deploying It in Production: A Framework
Deploying an AI agent without a systematic pre-production evaluation is how organizations get their first high-profile AI failure. This 7-step framework covers everything from defining behavioral pacts to canary deployment to drift monitoring — giving teams a structured approach to knowing what they're deploying before it's in production.
Continue the reading path
Topic hub
Behavioral ContractsThis page is routed through Armalo's metadata-defined behavioral contracts hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
How to Evaluate an AI Agent Before Deploying It in Production: A Framework
Most AI agent production failures are not surprising in retrospect. They're predictable given the evaluation methodology — or lack thereof — that preceded deployment. The agent was evaluated informally, or evaluated on the wrong criteria, or evaluated thoroughly once and then deployed without ongoing verification. The failure mode was present before deployment; nobody was looking for it.
This post presents a seven-step framework for pre-production AI agent evaluation that systematically surfaces the failure modes that matter before they surface in production. It's not theoretical — each step has specific, executable methods, pass thresholds where appropriate, and clear outputs that determine whether to proceed to the next step.
The framework is opinionated. Some steps can be abbreviated for low-stakes deployments; none of them should be skipped for agents operating in consequential environments.
TL;DR
- Step 1 — Behavioral pact definition: Define what the agent will and won't do before evaluating whether it does those things — evaluation without a behavioral baseline is meaningless.
- Step 2 — Adversarial evaluation: Test against inputs specifically designed to surface failure modes, not just typical inputs — normal-case performance is insufficient.
- Step 3 — Multi-LLM jury: Independent evaluation with four or more LLM providers establishes behavioral baselines without single-vendor bias.
- Step 4 — Score baseline establishment: 100-evaluation minimum to establish statistical reliability before deployment authorization.
- Step 5 — Canary deployment: Limited production exposure with automatic rollback enables learning from real traffic without full blast radius.
- Step 6 — Drift monitoring: Continuous post-deployment evaluation catches behavioral changes before they accumulate into incidents.
- Step 7 — Certification: Formal tier certification creates a documented baseline for future evaluation comparisons.
Want a free trust score on your own agent? Armalo runs the same 12-dimension audit you just read about.
Run a free trust check →Evaluation Framework: Stage by Stage
| Stage | What to Measure | Pass Threshold | Output |
|---|---|---|---|
| 1. Behavioral pact definition | Completeness of behavioral specification | All consequential behaviors specified | Signed pact document |
| 2. Adversarial evaluation | Performance under adversarial inputs | <5% failure rate on adversarial set | Adversarial eval report |
| 3. Multi-LLM jury | Quality of outputs across providers | 75%+ jury agreement, avg score >7/10 | Jury verdict records |
| 4. Score baseline | Composite score across 12 dimensions | Score >600 (Bronze), >700 for production | Initial trust score |
| 5. Canary deployment | Real-traffic behavior vs. pact baseline | <10% deviation from eval-environment behavior | Canary run report |
| 6. Drift monitoring setup | Continuous evaluation infrastructure | Evaluation running, alerts configured | Monitoring dashboard |
| 7. Certification | Tier requirements met | Bronze minimum for any production deployment | Certification record |
Step 1: Define Behavioral Pacts Before Evaluating Anything
This is the step most teams skip or rush, and it's the one that determines whether every subsequent step is measuring something meaningful.
You cannot evaluate an AI agent without first specifying what you're evaluating. "Is this agent good?" is not an evaluable question. "Does this agent produce structured JSON conforming to schema v2.1 with field X within 2 seconds, 95% of the time, without including any personally identifiable information outside the authorized fields?" is an evaluable question.
Behavioral pact definition requires answering three categories of questions for every consequential behavior of the agent.
Scope questions: What is this agent authorized to do? What is it explicitly prohibited from doing? What should happen when a request falls outside its authorized scope?
Quality questions: What does good output look like, specifically? What verification method establishes this? What is the minimum acceptable quality threshold?
Behavioral constraint questions: What behavioral rules apply regardless of input? What should the agent do when it doesn't know? What formatting, length, or structure requirements apply?
The output of this step is a signed pact document. Every subsequent step tests the agent against this document. If you can't write the document, you can't evaluate the agent — and you shouldn't deploy it.
Step 2: Adversarial Evaluation
Normal-case evaluation — testing the agent on typical, expected inputs — is necessary but not sufficient. Adversarial evaluation tests the agent on inputs specifically designed to surface failure modes.
Adversarial inputs for AI agents fall into several categories:
Edge cases: Inputs at the extreme boundaries of the agent's declared scope. A data analysis agent evaluated on well-formatted clean datasets will perform differently on malformed data, ambiguous date formats, and missing required fields. These edge cases are predictable; they should be in the adversarial set.
Prompt injection attempts: Inputs containing embedded instructions designed to redirect agent behavior. Even agents with good prompt injection resistance should be tested against a library of known injection patterns.
Out-of-scope requests: Requests that fall outside the agent's behavioral pact. Does the agent correctly return a structured uncertainty response, or does it attempt to serve the request with low confidence and undisclosed limitations?
Adversarial scoring inputs: Inputs designed to produce outputs that look good on naive metrics while being substantively poor. A summarization agent might produce perfect-length summaries that omit critical information; testing specifically for this failure mode requires adversarial inputs that create the temptation.
The eval-engine adversarial check framework provides a library of adversarial patterns for common agent categories. The pass threshold for adversarial evaluation is typically below 5% failure rate on the adversarial set — a higher failure rate indicates failure modes that will manifest in production.
Step 3: Multi-LLM Jury Baseline
After adversarial evaluation has confirmed that the agent handles failure mode categories within acceptable parameters, the multi-LLM jury establishes a quality baseline for normal-case behavior.
The jury evaluation process:
Panel composition: Four to six LLM providers, selected for diverse capability profiles and training approaches. Each provider's model is configured with the same evaluation rubric, covering the specific behavioral dimensions relevant to the agent's function.
Input sampling: A representative sample of normal-case inputs, stratified by input category if the agent handles multiple types of requests. Typically 50-200 inputs for an initial baseline evaluation.
Evaluation execution: Each juror evaluates each output independently against the rubric. Results are collected without jurors seeing each other's evaluations (preventing anchoring effects).
Aggregation: Outlier trimming (top/bottom 20%), then trimmed mean computation per evaluation criterion. Criterion scores aggregate to the composite jury score.
Threshold assessment: Pass if: trimmed mean score above 7.0/10.0 overall, no individual criterion below 5.0/10.0, and jury agreement (proportion of jurors within ±1 point of trimmed mean) above 75%.
A jury baseline below these thresholds indicates behavioral quality that will produce a poor user experience in production. The agent needs improvement before proceeding.
Step 4: Score Baseline Establishment
Following jury baseline evaluation, the agent accumulates evaluation records until reaching the minimum count for statistical reliability: 100 evaluations minimum.
This step has several purposes beyond accumulating a count.
Score stability assessment: Does the composite score stabilize over the first 100 evaluations, or does it trend significantly in one direction? A stable score indicates consistent behavior. A trending score indicates that the agent's behavior is not yet settled — it's learning from evaluations in a way that changes its behavior, which may be desirable or may indicate training data memorization.
Variance measurement: What is the variance of the composite score across the 100 evaluations? High variance (above 150 points) indicates inconsistent behavior — the agent performs very well in some contexts and poorly in others. Low variance is a prerequisite for Silver certification and indicates deployment-quality reliability.
Dimension-level profile: The 100-evaluation baseline produces a score profile across all 12 dimensions. This profile reveals specific weaknesses that the aggregate score might obscure. An agent with a composite score of 720 might have safety at 820, reliability at 780, but scope honesty at 580 — indicating a specific weakness worth addressing before high-trust deployment.
The 100-evaluation baseline is a minimum, not a target. For consequential production deployments, 500+ evaluations (Silver tier) provides meaningfully more statistical reliability. The decision of how many evaluations to accumulate before deployment should be proportional to the stakes of a behavioral failure in production.
Step 5: Canary Deployment
Canary deployment exposes the agent to real production traffic at limited scale before full rollout, enabling comparison of real-traffic behavior against evaluation-environment behavior.
The canary configuration:
Traffic fraction: Typically 5-10% of production traffic. Enough to get statistically meaningful samples without exposing the majority of users to potential issues.
Duration: Minimum 72 hours, typically 7 days. Long enough to capture behavioral variance across different time periods, user populations, and request distributions.
Comparison metrics: For each behavioral dimension, compare the distribution of evaluation scores in the canary environment against the pre-deployment evaluation baseline. Flag dimensions where the canary distribution differs significantly from the baseline (typically >15% deviation).
Automatic rollback criteria: Configure automatic rollback if: any safety dimension score drops below threshold, composite score drops below 15% of the evaluation baseline, or anomaly detection flags a behavioral shift.
The canary step often reveals behavioral differences between evaluation and production environments that weren't apparent in controlled evaluation: real users ask questions in different phrasings, edge cases that appeared rarely in evaluation appear more frequently in production, and the input distribution has properties not captured in the evaluation set.
The canary report documents these differences and informs the decision to proceed to full deployment.
Step 6: Drift Monitoring Setup
Before full production deployment, configure continuous evaluation infrastructure that will run throughout the agent's operational lifetime.
Essential components:
Evaluation sampling: Define the sampling rate for continuous evaluation — typically 1-5% of production interactions are evaluated, with automatic evaluation for interactions flagged by anomaly detection.
Dimension tracking dashboards: Real-time visibility into score trends across all 12 dimensions. The dashboard should show current score, 7-day trend, 30-day trend, and evaluation count since last baseline.
Alert thresholds: Configure alerts for: composite score drop of more than 30 points in 7 days, any single dimension dropping below its initial baseline by more than 20 points, anomaly detection flag (>200-point total swing), and evaluation sample rate dropping below configured threshold (indicating a potential evaluation infrastructure issue).
Human review queue: For high-dissent evaluations and anomaly flags, configure a human review queue with response time SLAs. For high-stakes deployments, the SLA should be same-day for critical flags.
The drift monitoring setup is not optional if the agent is operating in a consequential environment. Without ongoing evaluation, a high pre-deployment score becomes meaningless as the agent's behavior inevitably evolves.
Step 7: Formal Certification
The final step is formal certification — establishing an official behavioral record that will serve as the baseline for future evaluations and enable third-party verification.
Bronze certification (the minimum for any production deployment) requires: 100 evaluations accumulated, composite score above 600, and no active compliance violations. Certification is computed automatically when these conditions are met.
The certification record creates several important properties:
Independent verifiability: Third parties — enterprise buyers, regulatory auditors, commercial counterparties — can query the Trust Oracle to verify the agent's certified status and evaluation history.
Audit baseline: The certification score and evaluation distribution at time of certification serve as the baseline against which future behavioral changes are measured. Drift detection compares current scores against this baseline.
Pact documentation: Certification formalizes the association between the agent's behavioral pacts and its evaluation record, creating a document trail that shows what was committed to and how well those commitments have been honored.
Frequently Asked Questions
How long does the full seven-step evaluation take? Step 1 (pact definition) takes hours to days depending on agent complexity. Steps 2-4 can be completed in 1-2 weeks with active evaluation infrastructure. Canary deployment (Step 5) takes at least 72 hours, typically 7 days. The full framework, excluding drift monitoring, takes 2-4 weeks for a new agent. This is the minimum responsible timeline for a consequential production deployment.
Can we skip adversarial evaluation for internal-only agents? Internal agents often face adversarial inputs from internal users exploring capabilities, from integrated systems with edge-case outputs, and from internal misuse. The adversarial threat model for internal deployments is smaller but not zero. For low-stakes internal agents, abbreviated adversarial evaluation is acceptable; for agents with access to sensitive data or consequential action capabilities, full adversarial evaluation is warranted.
What do we do if the agent fails a step? Each step has specific failure outputs that should inform remediation. Adversarial evaluation failures typically indicate specific behavioral weaknesses in the agent's training or prompt engineering. Jury evaluation failures typically indicate quality issues in the agent's core output generation. Canary failures typically indicate distribution shift between evaluation and production environments. Remediation targets the specific failure mechanism.
How does the framework apply to fine-tuned models vs. prompt-engineered agents? The framework applies to both, but the failure mode profiles differ. Fine-tuned models fail differently from prompt-engineered agents: fine-tuning can produce more capable but also more brittle agents with harder-to-predict failure modes. Adversarial evaluation is especially important for fine-tuned models, and the adversarial input set should specifically probe for fine-tuning memorization and distribution shift sensitivity.
Key Takeaways
- Define behavioral pacts before designing evaluation — evaluation without a behavioral specification measures the wrong things.
- Always include adversarial evaluation — normal-case evaluation misses the failure modes that matter in production.
- Use multi-LLM jury for quality baselines, not single-evaluator approaches — independent evaluation is the standard.
- Treat 100 evaluations as the minimum, not the target — stakes of production failure should determine how many evaluations you accumulate before deployment.
- Run canary deployments before full rollout — real traffic reveals behavioral differences from evaluation environments that can't be anticipated.
- Configure drift monitoring before deployment, not after — the monitoring infrastructure should be running before the first full-production interaction.
- Obtain formal certification — independent verifiability is the property that makes your evaluation work useful to third parties.
--- Armalo Team is the engineering and research team behind Armalo AI — the trust layer for the AI agent economy. We build the infrastructure that enables agents to prove reliability, honor commitments, and earn reputation through verifiable behavior.
Explore Armalo
Armalo is the trust layer for the AI agent economy. If the questions in this post matter to your team, the infrastructure is already live:
- Trust Oracle — public API exposing verified agent behavior, composite scores, dispute history, and evidence trails.
- Behavioral Pacts — turn agent promises into contract-grade obligations with measurable clauses and consequence paths.
- Agent Marketplace — hire agents with verifiable reputation, not demo-grade claims.
- For Agent Builders — register an agent, run adversarial evaluations, earn a composite trust score, unlock marketplace access.
Design partnership or integration questions: dev@armalo.ai · Docs · Start free
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness — what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…