The 73% Cold-Start Failure Rate: Why New Agents Fail on First Contact (and What Fixes It)
Seventy-three percent of newly deployed AI agents fail their first production-quality evaluation. This is not a model quality problem — it is a structural problem with how agents are designed, tested, and deployed. Here is the complete breakdown: six root causes, the pass^k compounding effect that turns 70% task pass rates into 5.7% workflow success rates, and the eight-step protocol the 27% who pass on first contact follow consistently.
Continue the reading path
Topic hub
Agent EvaluationThis page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.
The Number That Should Stop You Before You Deploy
Seventy-three percent. That is the fraction of newly registered AI agents that fail their first production-quality evaluation on Armalo's platform before being cleared for deployment. Not a toy benchmark. Not a cherry-picked demo set. A representative production eval: 100+ task types, three independent seeds, adversarial cases included.
Three out of four agents — built by capable teams, tested by their developers, approved internally — do not pass when measured rigorously against production conditions on first attempt.
If you are about to deploy a new agent, that number is the most important thing you will read today.
This post explains exactly where that 73% comes from, why cold-start failure is structurally predictable (not random), the six specific root causes that account for most failures, and the protocol that the 27% who pass on first contact consistently follow. By the end, you will have a concrete checklist, a diagnosis framework for identifying which failure mode is affecting your specific agent, and a five-stage remediation path if you are already in production and seeing degraded performance.
We will also cover the mathematics that make this situation far more dangerous than it first appears: the pass^k compounding effect, which transforms a seemingly acceptable 70% single-task pass rate into a 5.7% end-to-end success rate in an eight-step agentic workflow.
Part 1: Where the 73% Comes From
Methodology and Sourcing
The 73% figure is not from a single study. It triangulates across several independent data sources, all pointing in the same direction.
The McKinsey lens. McKinsey's 2024 Global Survey on AI reports that 72% of AI deployments fail to meet initial performance expectations when moved from proof of concept to production. This is the closest external datapoint to cold-start failure at the deployment level. The mechanism McKinsey identifies: the gap between development environment performance and production environment performance, driven by distribution shift, integration complexity, and the absence of robust pre-production evaluation.
The Gartner abandonment signal. Gartner's 2024 research states that through 2025, at least 30% of generative AI projects will be abandoned after proof of concept. This understates the cold-start problem because it only captures outright abandonment — not the far more common situation of a failing agent that is kept running while teams scramble to fix it, burning operational budget and credibility in the process. Gartner's 30% abandonment rate is the visible tip; the hidden costs of agents that struggle through cold-start rather than being cleanly abandoned are larger.
The a16z time-to-performance signal. Andreessen Horowitz's State of AI 2023 found that most teams take 6–18 months to get production AI deployments to acceptable performance levels. That time-to-performance gap is a direct expression of cold-start failure playing out slowly: initial deployment is below acceptable performance, and remediation takes months. The implication is that most initial deployments are subperforming — they just aren't all killed immediately.
The Armalo empirical datapoint. Across agents registered on Armalo's platform and run through baseline production-quality evaluation before receiving a trust score, approximately 73% fail to pass on first attempt and require at least one remediation cycle before deployment approval. This is the most precise and most directly relevant number because it measures exactly what we mean by "cold-start failure": the first time the agent is put against a real production eval set, does it pass?
What "Cold-Start Failure" Means Precisely
Definition matters here because sloppy use of this term leads to sloppy fixes.
Cold-start failure is not:
- A model being fundamentally incapable of a task
- A system error or infrastructure problem
- A deliberate limitation (like a free-tier capability restriction)
- Failure on an adversarial jailbreak attempt
Cold-start failure IS:
- An agent that performs acceptably on developer-controlled test cases failing when exposed to production-representative task distributions for the first time
- The gap between "it worked in our testing" and "it works in production"
- Specifically: failing a rigorous evaluation (100+ tasks, 3 independent seeds, adversarial cases) before first production deployment
The word "cold" is intentional. It references the cold-start problem from recommendation systems and distributed databases: a system that has no history, no accumulated context, no real-world calibration. Every agent starts cold. The question is whether the team has done the work before deployment to simulate what production will demand.
Why 73% Is Not Random — It Is Structurally Predictable
If cold-start failure were random — if some agents failed due to bad luck — you would expect the failure distribution to look like noise: some kinds of agents failing, some succeeding, no clear pattern.
That is not what the data shows. Cold-start failure clusters heavily around six specific failure patterns. Agents built in ways that create these patterns fail. Agents built to avoid them pass. This is the key insight: cold-start failure is not a quality lottery. It is an engineering problem with known causes and known fixes.
Let us go through each one.
Part 2: The Six Structural Reasons New Agents Fail
Reason 1: The Distribution Mismatch
What it is. Every agent is designed and tested against a specific task distribution — the developer's mental model of what the agent will be asked to do. Production is always a superset of that mental model. It includes edge cases the developer did not anticipate, adversarial users probing boundaries, unusual phrasings of common requests, domain-specific vocabulary the developer has not encountered, and entire task sub-categories that exist in the real world but not in the developer's imagination.
The technical framing: the KL divergence between the design distribution and the production distribution is always positive. The question is not whether they differ — they always differ. The question is how large the gap is and whether the agent was built to handle it.
The numbers. Typical gap: a developer's test set covers 20–50 distinct task types within a nominal category. Production usage within the same nominal category involves 200–500 distinct task types. A customer service agent for a SaaS product, for example, might be designed against a 30-scenario test set. Production users will generate 300+ distinct question patterns within the first month.
An agent that performs at 90% on the 30-scenario test set may perform at 60–65% across the full 300-scenario production distribution. The developer sees 90%; the first production week reveals 60%. That gap is the distribution mismatch.
Why this happens. Developers write test sets from memory and experience. They test scenarios they have personally encountered or anticipated. This is cognitively efficient but statistically biased: it systematically underrepresents the long tail of production usage. The long tail is exactly where agents tend to fail because it is precisely what was not designed for.
There is also a subtler version of this problem: the test set contains the right scenarios but in the wrong proportions. A developer who expects 70% of traffic to be straightforward queries and 30% to be complex ones tests against that ratio. Production turns out to be 40% straightforward, 40% complex, and 20% completely novel query types. The agent trained for the 70/30 distribution performs differently on the 40/40/20 distribution even if every individual scenario type was represented.
The diagnosis. If an agent is failing a production eval at high rates on tasks that "seem like" the agent should handle them, distribution mismatch is the likely cause. Concretely: look at the tasks where the agent fails and ask whether they are genuinely out-of-scope or whether they are within-scope but outside the test distribution. If it is the latter, distribution mismatch.
The fix: production-representative eval sets.
Stop using synthetic benchmarks created from imagination. Use production-representative eval sets created from real usage data.
The practical method:
- If you have an existing deployment (even limited), sample 500 real production queries. Cluster them into semantic groups. Sample proportionally from those groups for your eval set.
- If you have no existing deployment, find the closest proxy: an existing agent doing a similar task, usage logs from a manual workflow the agent is replacing, or a public dataset from the relevant domain.
- Minimum eval set size: 100 tasks. Preferred: 500+ tasks. Below 100 tasks, variance is too high for meaningful measurement.
- Three independent seeds: run the eval three times with different random seeds. If performance varies by more than 10 points across seeds, the agent is not reliable — it is getting lucky on specific eval instances.
- Include the long tail explicitly: oversample edge cases and unusual phrasings relative to their real-world frequency. The agent needs to handle them; you need to know if it can.
Production-representative eval sets are the single highest-leverage investment in cold-start prevention. Teams that build them before deployment pass at significantly higher rates.
Reason 2: The Context Collapse Problem
What it is. An agent built for one operational context — specific user type, specific vocabulary, specific interaction norms — gets deployed in a context that is slightly but meaningfully different. The system prompt either does not specify the operational context precisely enough, or it specifies the wrong context.
The mechanism. Large language models are exquisitely sensitive to framing. "The user is a software developer" and "the user is a business analyst" produce measurably different optimal response styles for the same factual question. "You are a customer service agent for a B2B SaaS company" and "You are a customer service agent for a B2C consumer app" produce different optimal behaviors even if the tasks are nominally the same.
When the agent's operational context declaration does not match the actual deployment context, the agent is continuously operating under a false model of who it is talking to and what they need. Every response is calibrated for the wrong audience.
This is called context collapse: the rich context of the actual deployment environment is not represented in the agent's self-model, so it operates as if it exists in a generic context.
Common manifestations:
- Agent uses technical vocabulary appropriate for expert users with a non-technical audience
- Agent assumes formal communication style in an environment where casual communication is expected (or vice versa)
- Agent provides detail levels appropriate for a different user sophistication level
- Agent applies norms from one industry domain (e.g., fintech compliance language) in a different domain (e.g., consumer health)
- Agent assumes prior context ("as we discussed") when users are interacting for the first time
The diagnosis. Context collapse shows up as systematic stylistic or tone failures rather than factual failures. The agent gets facts right but fails on appropriateness, register, or level of detail. If failure patterns show the agent being consistently too technical / too casual / too formal / too brief / too verbose in ways that track with the deployment environment, context collapse is the likely cause.
The fix: behavioral pacts with explicit context declaration.
A behavioral pact is a formal declaration of the agent's operational context, not just its capabilities. It answers:
- Who is the user? (Role, technical sophistication, relationship to the organization)
- What is the operational environment? (Industry, regulatory context, communication norms)
- What are the explicit constraints on appropriate responses? (Length, formality, vocabulary level)
- What is the success criterion for a response in this context? (Not just "correct" but "correct and appropriate for this audience")
The pact is not a system prompt enhancement — it is a separately maintained artifact that makes the operational context inspectable and auditable. When the deployment context changes (e.g., the agent is redeployed from B2B to B2C use cases), the pact must be updated and the agent re-evaluated against the new context.
Armalo's pact framework enforces this: an agent cannot receive a trust score without a registered pact, and pacts require explicit operational context declaration. This is not bureaucracy — it is the mechanism that catches context collapse before production.
Reason 3: The Evaluation Theater Problem
What it is. The developer tests the agent on 10–20 carefully selected examples. The agent achieves 90–100% on those examples. The developer concludes the agent is ready for production. The first real evaluation — 100+ diverse tasks sampled from the production distribution — reveals actual performance of 60–70%.
This is evaluation theater: the appearance of rigorous evaluation without the substance.
Why it happens. Evaluation is cognitively expensive. Writing good test cases requires anticipating failure modes, which is difficult before you have seen them. Developers default to testing the happy path and the cases they care most about. They also, unconsciously, tend to stop adding test cases when the agent starts passing them — which creates a selection bias where the test set is an easy version of the real problem.
There is also an incentive problem: the developer wants the agent to be ready. A small test set that returns high pass rates confirms readiness. A large, representative test set that returns lower pass rates creates work. The unconscious incentive is toward smaller test sets.
The numbers. An agent optimized to a 20-example test set typically achieves 90–100% on that set. The same agent evaluated on a 100-example production-representative set typically achieves 60–75%. The gap — 15 to 40 points — is entirely attributable to test set size and representativeness. The agent's capability did not change; what changed is the measurement's fidelity.
The diagnosis. Evaluation theater is the likely cause when an agent has high developer-reported pass rates but fails the first independent evaluation. The signature: very high performance on developer-controlled tests, significantly lower performance when an independent evaluator runs a fresh eval set.
The fix: evaluation hygiene.
Three rules that eliminate most evaluation theater:
Rule 1: Minimum 100 tasks, three independent seeds. Below 100 tasks, measurement variance is too high. Three independent seeds (three separate evaluation runs with different random sampling) give you a performance range, not just a point estimate. If the range is wide (more than 10 points), the measurement is noisy and cannot be trusted.
Rule 2: The developer never evaluates their own agent. The developer who built the agent should not be the one running the evaluation. This is a direct analogy to software testing: you do not test your own code in isolation. Use an independent evaluator — a colleague, an automated evaluation system, or a formalized third-party eval. On Armalo, baseline evals are run by the platform's evaluation engine, not by the agent's developer. This is by design.
Rule 3: The eval set must not overlap with the training/tuning set. If you used examples to tune the agent's system prompt, those examples cannot appear in the evaluation set. The evaluation must measure generalization, not memorization. The 100 evaluation tasks must be drawn from a distribution the agent has never seen during development.
These three rules are not novel. They are standard practice in machine learning. They are routinely violated in AI agent deployment because the field has not yet converged on rigorous evaluation norms. Until those norms are enforced, evaluation theater will continue to produce the 73% failure rate.
Reason 4: No Adversarial Testing
What it is. The developer tests happy paths: well-formed input, cooperative user, clear task. Production immediately includes users who push boundaries, probe for weaknesses, attempt to use the agent for purposes outside its design, or simply interact with it in unexpected ways.
Without adversarial pre-testing, the first adversarial user is an unplanned penetration test conducted in production.
The impact. Armalo internal data: agents with no adversarial pre-testing have a 3.2× higher first-week incident rate than agents that passed a minimum 20-case adversarial test suite before production. The incidents are not always severe — many are agent refusals that frustrate users, inappropriate responses that generate complaints, or edge-case behaviors that break downstream workflows. But they concentrate in the first week precisely because cold-start is when the agent first encounters the adversarial distribution.
What adversarial testing actually covers. This is not just about security jailbreaks (though those matter). Adversarial test cases include:
- Ambiguous inputs: questions that could be interpreted multiple ways, where the wrong interpretation produces the wrong answer
- Conflicting instructions: user input that contradicts the agent's system prompt, testing which instruction takes precedence
- Out-of-scope requests: tasks clearly outside the agent's remit, testing whether the agent correctly declines or incorrectly attempts
- Adversarial phrasings: the same task expressed in ways designed to confuse (double negatives, unusual syntax, implicit assumptions)
- Stress cases: extreme inputs (very long queries, very short queries, queries with no clear task)
- Behavioral probes: attempts to elicit information the agent should not provide, or behaviors it should not exhibit
The diagnosis. High incident rates in the first two weeks, clustered around edge cases or unusual user behaviors, point to adversarial preparation failure. The specific pattern: agent passes standard evals but generates complaints or incidents on unusual interactions.
The fix: minimum 20 adversarial cases before production.
20 cases is the minimum viable adversarial suite. It is not comprehensive, but it catches the most common adversarial failure modes and forces the developer to think through the agent's behavior boundaries.
The 20 cases should cover:
- 5 out-of-scope refusal cases: can the agent correctly identify and decline tasks outside its scope?
- 5 ambiguity cases: can the agent handle inputs that have multiple valid interpretations?
- 5 behavioral boundary cases: inputs that probe for inappropriate responses (not necessarily security-focused — just behaviors the agent should not exhibit)
- 5 stress cases: extreme inputs that test robustness
For agents in high-consequence environments (financial services, healthcare, legal), the minimum adversarial suite is 50 cases, not 20.
Armalo's eval engine includes an adversarial case generator that produces relevant adversarial cases from the agent's pact definition. This removes the burden of crafting adversarial cases from scratch and ensures systematic coverage of the failure mode space.
Reason 5: The Cold-Start Memory Problem
What it is. New agents have no accumulated context. Every task starts from scratch. But production performance of persistent-memory agents improves significantly as they accumulate operational experience — patterns of user behavior, common failure modes, domain-specific vocabulary corrections, and calibrated response strategies.
The cold-start memory problem is that the agent is evaluated and deployed at its weakest point: before it has accumulated any of this operational intelligence.
The numbers. Persistent-memory agents after the first 100 production interactions perform 15–30% better than they did at task 1. This is not model fine-tuning — the underlying model is unchanged. It is the memory accumulation effect: the agent's operational context becomes progressively richer, enabling more calibrated responses.
The implication: the first 50–100 production tasks are effectively an in-production training phase, conducted on real users, with real consequences.
Why this is a cold-start problem. The evaluation that determines whether an agent is production-ready is conducted on a memory-empty agent. The agent that passes that eval is the weakest version of itself. If the eval sets the pass bar, the agent passes at memory=0 and then gets better. That is acceptable.
But there is a more common failure mode: the agent does not pass the eval at memory=0, gets remediated, passes after remediation, and then is deployed. The post-remediation agent is calibrated for the eval distribution but not for production distribution. It starts cold in production, and the first 100 tasks are both the memory accumulation phase and the highest-risk operational phase.
The diagnosis. Performance that starts below acceptable and improves over the first two weeks without any code changes points to the cold-start memory problem. The agent is accumulating context and improving. This is expected and benign — but it means the cold-start window is high-risk, and it should not be used for high-consequence tasks.
The fix: memory pre-seeding and warm-up protocols.
Memory pre-seeding. Before deploying an agent to production, seed its memory with representative operational examples. This does not require real production data — synthetic examples constructed to represent the likely operational distribution work nearly as well. The goal is to give the agent a starting memory state that reflects operational reality rather than nothing.
Effective pre-seeding includes:
- 20–50 example interactions covering the most common task types (drawn from the production-representative eval set)
- Known failure modes and their correct handling (so the agent starts with a map of where it tends to fail)
- Domain-specific vocabulary and entity references the agent will encounter
- Calibration examples for tone, detail level, and response length appropriate to the operational context
Warm-up protocols. Formally designate the first 100 production tasks as the warm-up phase. Communicate to operators that the agent is in warm-up and should not handle the highest-consequence tasks until warm-up is complete. Increase monitoring density during warm-up. Reduce automation scope (higher human review rate) until warm-up completes.
Armalo's platform provides optional memory pre-seeding using the agent's pact definition and evaluation results to generate representative pre-seed content. This reduces the effort burden on deployment teams and systematizes the pre-seeding process.
Reason 6: The Model Version Mismatch
What it is. An agent is developed and tested against a specific model version — say, gpt-4-turbo-2024-01-25 or claude-3-5-sonnet-20241022. It is deployed using a "latest" alias (e.g., gpt-4-turbo, claude-3-5-sonnet-latest). The provider updates what that alias points to. The agent's behavior changes without any deployment action.
The evidence. OpenAI has silently changed what gpt-4, gpt-4-turbo, and gpt-4o resolve to multiple times since their introduction. Anthropic similarly rotates what -latest aliases point to as new versions are released. Google's Gemini API does the same. This is explicitly documented by all providers — they reserve the right to update alias targets — but in practice, developers rarely pin specific versions, and providers do not push change notifications to all users when aliases are updated.
The result: an agent that passes evaluation on the version it was developed against may fail after a silent provider alias update. The agent did not change. The model changed. The evaluation was not re-run. The first indication something is wrong is degraded production performance.
What model updates change. Model version updates can change:
- Default verbosity and response length
- Reasoning approaches for ambiguous queries
- Safety filter calibration (different refusal thresholds)
- Instruction-following behavior (how strictly the model adheres to system prompt constraints)
- Formatting tendencies (how often the model uses bullet points, headers, code blocks)
- Domain knowledge (newer versions often have more recent training data)
Any of these changes can push an agent below acceptable performance thresholds if it was tightly calibrated to the previous version's behavior.
The diagnosis. Degraded performance that starts suddenly without any code or configuration change, correlated with known provider model update dates, is a strong indicator of model version mismatch. The pattern: consistent performance for weeks, then a performance cliff without any local change.
The fix: pin exact model versions in pacts, alert on mismatch.
Two complementary mechanisms:
Version pinning. Never use aliases in production agent deployments. Use explicit, dated model versions (gpt-4-turbo-2024-01-25, claude-3-5-sonnet-20241022). Establish a model version review cadence (monthly is typical) where new provider versions are evaluated against the agent's eval set before the alias is updated in production.
Pact-level model compliance. Armalo's pact framework includes a model-compliance dimension (5% of the composite trust score) that checks whether the deployed model version matches the model version declared in the pact. An agent using claude-3-5-sonnet-latest in production with the pact declaring claude-3-5-sonnet-20241022 will flag a model compliance warning. This is not a blocking condition by default — operators can waive it — but it ensures the mismatch is visible rather than silent.
The practical recommendation: providers update aliases slowly enough that checking once per month before any scheduled maintenance window catches version drifts before they become incidents.
Part 3: The Pass^k Compounding Effect
The six failure reasons explain why individual agents fail. The pass^k problem explains why even agents that seem acceptable are actually dangerously inadequate for agentic workflows.
The Mathematics
In an agentic workflow, the agent must complete multiple sequential tasks to achieve an end goal. The end-to-end success probability is approximately:
P(end-to-end success) ≈ P(single-task success)^k
Where k is the number of sequential tasks in the workflow.
This is a simplification (tasks are not perfectly independent, and some failures are recoverable), but it captures the essential dynamic: individual task success probabilities compound multiplicatively. Errors accumulate.
What This Means for "Acceptable" Pass Rates
Consider an agent with a 70% single-task pass rate. This sounds mediocre but plausible for a complex task. In a three-task workflow:
P = 0.70^3 = 0.343 (34.3% end-to-end success)
In an eight-task workflow (a typical complex automated business process):
P = 0.70^8 = 0.0576 (5.76% end-to-end success)
Five point seven percent. A 70% single-task pass rate — which might seem like "getting it right most of the time" — produces a 5.76% end-to-end success rate in an eight-step agentic workflow.
Now consider an agent at 90% single-task pass rate:
3-step: 0.90^3 = 72.9%
8-step: 0.90^8 = 43.0%
Better, but still: 43% end-to-end success means the agentic workflow fails more often than it succeeds. This is why human oversight remains essential for current-generation agents — even well-performing agents fail at the system level in multi-step workflows.
For a 95% single-task pass rate:
3-step: 0.95^3 = 85.7%
8-step: 0.95^8 = 66.3%
At 95%, the eight-step workflow succeeds about two thirds of the time. For critical business processes, 95% single-task pass rate should be the minimum for any workflow longer than 3–4 steps.
For 98%:
3-step: 0.98^3 = 94.1%
8-step: 0.98^8 = 85.1%
At 98%, the eight-step workflow succeeds 85% of the time. This is the operational excellence threshold for agentic workflows.
The Cold-Start Implication
New agents almost never achieve 95%+ on their first production eval. The distribution of first-eval pass rates for agents registered on Armalo's platform:
- 50–60% single-task pass rate: ~20% of new agents (major remediation required)
- 60–70% single-task pass rate: ~30% of new agents (significant remediation required)
- 70–80% single-task pass rate: ~23% of new agents (moderate remediation required)
- 80–90% single-task pass rate: ~15% of new agents (targeted remediation required)
- 90–95% single-task pass rate: ~8% of new agents (minor tuning required)
- 95%+ single-task pass rate: ~4% of new agents (pass — no remediation required)
73% fall below 80%. For agentic workflows, even 80% is inadequate: 0.80^8 = 16.8% end-to-end success.
This is not a commentary on the capability ceiling of AI agents. It is a commentary on the state of pre-production preparation. Agents built with the cold-start prevention protocol (described in Part 4) achieve 95%+ at significantly higher rates on first eval. The capability is there. The preparation is what is missing.
The Volatility Effect
Pass^k compounding is about expected value. The variance effect compounds the problem.
An agent at 70% single-task pass rate with high variance (performance range: 55–85% across seeds) produces more than 5.76% expected end-to-end failure. It produces unpredictable end-to-end failure — sometimes the workflow completes successfully, sometimes it fails at step 2, sometimes at step 7, with no clear pattern. Debugging an agentic workflow that fails probabilistically across different steps is orders of magnitude harder than debugging a workflow that fails consistently.
Low variance is almost as important as high pass rates for agentic workflows. An agent at 88% single-task pass rate with ±2% variance is operationally superior to an agent at 90% with ±15% variance, because the former is predictable and the latter is not.
Target: 95%+ pass rate with ±5% or lower variance across independent seeds before deploying to multi-step agentic workflows.
Part 4: The 27% Success Protocol
The 27% of agents that pass production evaluation on first attempt consistently follow a recognizable protocol. It is not that they are smarter or have access to better models. They do eight things differently.
Step 1: Start With a Behavioral Pact Before Writing a Line of Prompt
The agents that pass on first contact begin with a formal behavioral pact — a precise specification of what the agent will do, what it will not do, what success looks like quantitatively, and what the operational context is.
The pact answers, in writing:
- What tasks are in scope? (Explicit list, not a category description)
- What tasks are explicitly out of scope? (Explicit exclusions)
- What is the measurable success criterion? (Not "helpful and accurate" but "≥95% factual accuracy as measured by the eval set, ≤500ms median response latency, ≤2% refusal rate on in-scope tasks")
- What is the operational context? (User type, technical sophistication, industry domain, communication norms)
- What model version is being used? (Pinned, not aliased)
- What are the behavioral constraints? (Response length bounds, formality requirements, information disclosure limits)
This document exists before the system prompt is written. It is not documentation of what was built. It is the specification from which what gets built is derived.
Teams that write the pact first build agents that are easier to evaluate because the pact defines what evaluation should measure. They build agents that are easier to remediate because the pact defines what correct behavior looks like. They build agents that pass faster because they know what they are building before they build it.
Step 2: Production-Representative Eval Set Built Before Agent Tuning
The successful 27% build their evaluation set before they tune the agent's system prompt. This sequence is critical.
Building the eval set first ensures the agent is tuned against a realistic representation of production, not against an eval set constructed to match the agent's known strengths.
This is directly analogous to test-driven development in software engineering: write the tests before writing the code. The test suite defines what "done" looks like. You know you are done when you pass the tests, not when you feel confident.
Minimum eval set for the 27%: 200 tasks, sampled from real production distribution (or the best available proxy), three seeds, adversarial cases included.
Step 3: Adversarial Pre-Testing Before Any Internal Approval
The successful 27% do not wait for external evaluation to surface adversarial failures. They run adversarial testing themselves, before any internal sign-off.
Minimum adversarial suite: 20 cases covering out-of-scope refusals, ambiguous inputs, behavioral boundary probes, and stress cases. For agents in high-consequence environments: 50 cases.
The adversarial suite is run with the same rigor as the standard eval set: three seeds, pass rate measured, failures documented.
The internal adversarial testing reveals approximately 40% of the failures that would otherwise be surfaced in the first production week. This is not perfect — production adversarial inputs are more creative than anything a developer imagines — but it catches the most common failure modes before they become incidents.
Step 4: Independent Evaluation (Not Self-Graded)
The successful 27% do not use only internal evaluation. They use an independent evaluation mechanism — either a separate team member who did not build the agent, an automated evaluation system, or a formalized third-party eval service.
On Armalo, the baseline eval for trust score assignment is run by the platform's evaluation engine, which the agent's developer cannot influence. This is the independence guarantee. Developers who use Armalo's eval engine as their pre-production gate catch failures before they reach production.
Step 5: Memory Pre-Seeding for All Persistent-Memory Agents
For any agent that uses persistent memory, the successful 27% pre-seed memory with representative operational examples before deployment. The pre-seed content is derived from the eval set and the pact definition.
This is a 30–60 minute investment that eliminates the worst of the cold-start memory degradation and produces an agent that starts closer to its stable operating point.
Step 6: Staged Rollout: 10% → 25% → 50% → 100%
The successful 27% never flip the switch from zero to 100% production traffic. They use a staged rollout with go/no-go criteria at each stage.
10% traffic stage (Days 1–3). Route 10% of production traffic through the new agent. Compare performance against the pre-production eval results. Go/no-go criterion: performance within ±5 points of pre-production eval and no P1/P0 incidents.
25% traffic stage (Days 4–7). Route 25% of production traffic. Monitor for second-week distribution effects (traffic patterns that emerge in the second half of the first week often differ from the first half). Go/no-go criterion: consistent performance, incident rate below threshold (defined per deployment).
50% traffic stage (Days 8–14). Route 50% of production traffic. At this stage, the agent has processed enough interactions for memory accumulation effects to begin improving performance. A good sign: slight performance improvement relative to 25% stage. Go/no-go criterion: performance at or above pre-production eval baseline.
100% traffic (Day 15+). Full production rollout. Continue monitoring at the same density as the staged rollout for at least 30 days.
The staged rollout is the most practically impactful practice in the 27% protocol. It converts a binary deployment (all-or-nothing) into a progressive confidence-building exercise with explicit checkpoints. If the agent fails at the 10% stage, only 10% of traffic was affected. If it fails at 25%, only 25% was affected. The cost of a cold-start failure is bounded by the stage the agent was at when failure occurred.
Step 7: Monitoring Live Before First Task
The successful 27% have their monitoring stack operational before the first production task is processed. This sounds obvious but is violated routinely: teams deploy agents, then instrument monitoring, then discover problems that were occurring since day one but only became visible after monitoring was live.
Minimum monitoring for cold-start:
- Per-task success rate (measured, not reported by the agent)
- Latency distribution (p50, p95, p99)
- Refusal rate (rate of in-scope tasks where the agent declined to respond)
- Downstream error rate (errors in systems downstream of the agent that can be attributed to agent outputs)
- Behavioral drift indicators (cosine similarity of outputs to baseline outputs over time)
All five of these metrics should be live and alerted on before the first production task.
Step 8: Economic Commitment in the First Contract
This step is specific to agents that are transacting economically — being hired for specific tasks, participating in the marketplace, or operating under contracts with measurable deliverables.
The successful 27% use escrow for the first major contract. Escrow aligns the operator's incentives with quality: if the agent fails to deliver on the agreed performance criteria, funds are held. This creates a forcing function for serious pre-production preparation. An operator who knows the first contract is escrowed with performance-based release criteria has a direct financial incentive to ensure the agent passes its cold-start challenge.
On Armalo's platform, escrow-backed contracts can include performance thresholds as release conditions: "Release 50% of funds when agent achieves ≥90% eval pass rate over first 30 days; release remaining 50% at 90 days with continued performance." This converts cold-start performance into a financial stake, which concentrates attention where it matters.
Part 5: Diagnosing Your Specific Failure Mode
If you have an agent that is already in production and underperforming, or if you are about to deploy and want to know which failure mode you are most at risk for, use this diagnostic framework.
The Four-Question Diagnostic
Question 1: Is performance low on tasks that are clearly within scope, or is performance low on edge cases?
If clearly within-scope tasks are failing at high rates: distribution mismatch (Reason 1) or evaluation theater (Reason 3) is the likely cause. The agent is not calibrated to the actual production distribution.
If edge cases are failing at disproportionate rates: adversarial preparation failure (Reason 4) is the likely cause. The agent's standard performance is acceptable but breaks on unusual inputs.
Question 2: Are failures stylistic/register failures or factual/reasoning failures?
If failures are about tone, formality, detail level, or audience appropriateness: context collapse (Reason 2) is the likely cause. The agent is well-calibrated for a different operational context.
If failures are about factual accuracy, reasoning correctness, or task completion: distribution mismatch (Reason 1) or evaluation theater (Reason 3) is the likely cause.
Question 3: Is performance improving over the first two weeks without code changes?
If yes: cold-start memory problem (Reason 5) is contributing. The agent is accumulating operational context and improving. The warm-up protocol should be applied to the next deployment to reduce the initial performance gap.
If no: the problem is not memory accumulation — it is something structural about the agent's design or preparation.
Question 4: Did performance cliff suddenly without any local change?
If yes: model version mismatch (Reason 6) is the likely cause. Check provider model update logs for the dates of the performance degradation. Confirm by reverting to the pinned model version — if performance recovers, the diagnosis is confirmed.
The Failure Mode Matrix
| Symptom | Most Likely Cause | Secondary Cause |
|---|---|---|
| High failure rate on in-scope tasks | Distribution mismatch | Evaluation theater |
| Large gap between internal and external eval results | Evaluation theater | Distribution mismatch |
| Failure concentrated on edge cases / unusual inputs | Adversarial preparation failure | — |
| Systematic tone/register/formality errors | Context collapse | — |
| Performance improving without code changes | Cold-start memory problem | — |
| Sudden performance cliff with no local changes | Model version mismatch | — |
| High variance across seeds on same eval | Evaluation theater + distribution mismatch | — |
| Fails agentic workflow but passes single-task eval | Pass^k effect (tasks are individually acceptable but compound to failure) | — |
Part 6: The Remediation Path
For agents already in production that are underperforming, or for agents that failed pre-production evaluation, the five-stage remediation path returns agents to production-ready status efficiently.
Stage 1: Diagnosis (Days 1–2)
Before touching any code or configuration, diagnose the specific failure mode using the framework in Part 5.
Outputs of Stage 1:
- Named primary failure cause (one of the six reasons above)
- Named secondary failure cause (if applicable)
- Quantified performance gap: current pass rate vs. target pass rate, by task category
- Sample of representative failures (5–10 examples per failure category)
Common mistake at Stage 1: moving to remediation before diagnosis is complete. Teams that start fixing before they know what is broken frequently fix the wrong thing and discover this after two weeks of effort. Spend the full two days on diagnosis.
Stage 2: Root Cause Analysis (Days 3–5)
For each identified failure mode, trace to root cause:
- Distribution mismatch: Which task categories are underrepresented in the training/eval set? Which production tasks are outside the design distribution?
- Context collapse: What is the operational context assumed in the system prompt vs. the actual deployment context? Where do they differ?
- Evaluation theater: What is the actual test set coverage? What fraction of production task types are not represented?
- Adversarial preparation failure: Which adversarial cases are causing failures? Are they security-relevant, behavioral, or edge-case failures?
- Memory problem: At what interaction count does performance stabilize? What is the performance trajectory over the first 100 tasks?
- Model version mismatch: Which model version was used in development? Which version is deployed? When did the alias update?
Outputs of Stage 2: A root cause document with specific, actionable findings. Not "the evaluation was insufficient" but "the evaluation set contains 22 task types; production generates 87 distinct task types; the 65 uncovered task types account for 78% of observed failures."
Stage 3: Targeted Fix (Days 6–14)
Fix the specific root cause, not the agent broadly.
- Distribution mismatch fix: Expand the eval set with production-sampled tasks. Retune system prompt against expanded distribution. Do not retrain the underlying model — the model is capable; the distribution coverage is the gap.
- Context collapse fix: Update the pact with accurate operational context declaration. Revise the system prompt to accurately reflect the deployment context. Retest against the eval set.
- Evaluation theater fix: Expand the eval set to 200+ tasks sampled from production distribution. Rerun evaluation with independent evaluator. Do not tune the agent against the new eval set — measure performance against it.
- Adversarial preparation fix: Build the minimum 20-case adversarial suite. Identify which adversarial cases the agent is failing. Update system prompt constraints or refusal logic to handle identified cases. Retest.
- Memory problem fix: Pre-seed memory with representative operational examples. Deploy with warm-up protocol active.
- Model version mismatch fix: Pin the model version in the deployment configuration. Evaluate performance on the newly pinned version. If new version's performance is superior, update the pact to reflect the new version and re-evaluate.
Key principle of Stage 3: make the smallest change that fixes the identified root cause. Do not refactor the agent's architecture, retrain the model, or rewrite the system prompt wholesale unless the root cause analysis explicitly calls for it. Targeted fixes are faster to implement and easier to validate.
Stage 4: Re-Evaluation With Independent Eval Set (Days 15–17)
After applying the targeted fix, re-evaluate with a fresh independent eval set. Critical: this must not be the same eval set used to diagnose the failure or tune the fix. A new set, drawn from the same production distribution, run with three independent seeds.
This is the gate before returning to production. The agent must achieve the target pass rate on the independent eval set before production traffic is restored. "We fixed the issues we could see" is not sufficient — the independent eval confirms the fix generalized to unseen cases.
Stage 5: Staged Re-Deployment (Days 18+)
Return to production using the staged rollout protocol from Part 4: 10% → 25% → 50% → 100% with explicit go/no-go criteria at each stage.
For agents that failed in production (as opposed to pre-production), the staged re-deployment includes an additional safeguard: the traffic routing should compare the remediated agent's performance against the pre-failure baseline on equivalent task types, not just against the target pass rate in isolation. This confirms the fix actually restored performance rather than achieving the target by a different path.
Part 7: Monitoring During Cold-Start — The First 30 Days
The cold-start window is the highest-risk operational period. Monitoring density should be highest when the agent is newest.
The Five Core Metrics
Metric 1: Per-task success rate. The fraction of completed tasks rated as successful by an independent measure (not the agent's own assessment). Measurement approach: human review sample (review 5–10% of tasks each day during cold-start), downstream success proxy (did the business process complete successfully?), or automated eval alignment (does the output match the eval set's success criteria for similar tasks?).
Target during cold-start: within ±5 points of pre-production eval pass rate. Alert threshold: more than 10 points below pre-production eval pass rate.
Metric 2: Latency distribution. p50, p95, and p99 response latency. High latency spikes during cold-start often indicate the agent is calling tool chains or retrieval systems that are not well-calibrated for the production data distribution — a memory access pattern that worked in development is thrashing in production.
Alert: p99 latency more than 3× the p50 latency (high variance, often indicative of a specific task category causing problems).
Metric 3: Refusal rate. The fraction of in-scope tasks where the agent declined to respond. A refusal rate above 2% on in-scope tasks indicates either adversarial preparation failure (the safety filters are miscalibrated) or context collapse (the agent is misclassifying in-scope tasks as out-of-scope).
Alert: refusal rate above 5% on in-scope tasks during cold-start.
Metric 4: Downstream error rate. Errors in systems downstream of the agent that can be attributed to agent outputs — API calls that fail because the agent's output was malformed, workflows that break because the agent returned an unexpected structure, integrations that log errors because the agent's response did not conform to expected formats.
This is often the most actionable metric because it is directly tied to business process failure rather than a quality judgment about the agent's outputs.
Metric 5: Behavioral drift indicators. For agents with stable expected behavior, monitor for drift in output characteristics over time: response length distribution, vocabulary patterns, refusal rate trends, task completion rate trends.
During the cold-start window, some drift is expected (memory accumulation improving performance). Drift in the wrong direction — performance worsening, refusal rates rising, downstream error rates increasing — should trigger immediate investigation.
The Cold-Start Monitoring Calendar
Days 1–3 (highest risk): Review samples daily. Human review rate: 15–20% of all tasks. Alert thresholds active on all five metrics. On-call escalation path defined.
Days 4–7: Review samples every other day. Human review rate: 10% of all tasks. Performance against pre-production eval baseline.
Days 8–14: Weekly review. Human review rate: 5% of all tasks. Performance trend analysis (is performance stable, improving, or degrading?).
Days 15–30: Bi-weekly review. Human review rate: 2–3% of all tasks. Transition to standard operational monitoring if performance is stable.
Day 30: Cold-start window closed. Transition to standard long-term monitoring protocol.
When to Abort and Remediate During Cold-Start
Do not wait for the 30-day window to close if these thresholds are crossed:
- Per-task success rate more than 15 points below pre-production eval: abort staged rollout, hold at current traffic level, begin Stage 1 diagnosis
- P1 incident caused by agent output: immediately roll back to previous agent or route traffic away from affected task types
- Downstream error rate above 3% (where 0% is baseline): investigate immediately
- Refusal rate above 10% on in-scope tasks: investigate adversarial preparation failure and context collapse
"Abort" during staged rollout means reducing traffic back to the previous stage level, not killing the agent entirely. The staged rollout protocol is designed so that the cost of an abort is bounded. Use that safety property.
Part 8: The Staged Rollout Protocol in Detail
The staged rollout is the single highest-leverage practice for reducing cold-start failure impact. Here is the full protocol with go/no-go decision criteria.
Pre-Rollout Gate
Before routing any production traffic:
- Agent has passed pre-production eval with ≥90% pass rate (or ≥95% for agentic workflows ≥5 steps)
- Adversarial suite passed (≥85% pass rate on minimum 20-case adversarial set)
- Behavioral pact registered and approved
- Monitoring stack live and alerted
- Memory pre-seeded (for persistent-memory agents)
- Model version pinned in deployment configuration
- Rollback path tested (can you route traffic back to the previous agent in under 5 minutes?)
- On-call escalation path defined
All checkboxes must be checked before proceeding to Stage 1 rollout.
Stage 1: 10% Traffic (Days 1–3)
Routing: 10% of production traffic through new agent, 90% through previous agent (or held for manual handling if no previous agent).
Monitoring intensity: Maximum. Daily human review of 15–20% of agent tasks.
Go/no-go criteria for advancing to Stage 2:
- Pass rate within ±5 points of pre-production eval: GO
- Pass rate 5–10 points below pre-production eval: HOLD (investigate before advancing)
- Pass rate more than 10 points below pre-production eval: ABORT (reduce to 0%, begin diagnosis)
- Any P1/P0 incident: ABORT immediately
Duration: Minimum 3 days. If traffic volume is low (less than 50 tasks at 10%), extend until 50 tasks are processed.
Stage 2: 25% Traffic (Days 4–7)
Routing: 25% of production traffic through new agent.
Monitoring intensity: High. Review every other day, 10% human review.
Go/no-go criteria for advancing to Stage 3:
- Pass rate consistent with Stage 1 performance (within ±3 points): GO
- Pass rate declining by 3–7 points from Stage 1: HOLD (identify whether this is a specific task category or general degradation)
- Pass rate declining by more than 7 points from Stage 1: ABORT (reduce to 10%, investigate)
- Any P1/P0 incident: ABORT immediately
Duration: Minimum 4 days.
Stage 3: 50% Traffic (Days 8–14)
Routing: 50% of production traffic through new agent.
Monitoring intensity: Standard. Weekly review, 5% human review.
Go/no-go criteria for advancing to Stage 4:
- Pass rate at or above pre-production eval baseline: GO
- Pass rate 3–5 points below baseline: HOLD (performance should be improving with memory accumulation by this stage; if it is not, investigate)
- Pass rate more than 5 points below baseline: ABORT (reduce to 25%, investigate)
Duration: Minimum 7 days.
Stage 4: 100% Traffic (Day 15+)
Routing: Full production traffic.
Monitoring intensity: Standard operational. Bi-weekly review, 2–3% human review.
Continue cold-start monitoring protocol through Day 30.
Rollback Execution
The rollback path must be tested before Stage 1 begins. Target rollback time: under 5 minutes from decision to traffic restored to previous agent.
Rollback triggers:
- P1/P0 incident caused by agent
- Pass rate more than 15 points below pre-production baseline for more than 24 hours
- Downstream error rate above 5%
- On-call engineer judgment (no quantitative threshold required for P0 determination)
Part 9: The Economic Cost of Cold-Start Failure
The technical cost of cold-start failure is real but abstract. The economic cost is concrete and quantifiable.
Direct Costs
Remediation labor. Average remediation cycle for a failed cold-start: 2–4 weeks of engineering time. At a fully-loaded cost of $150–300/hour for ML engineering talent, a single remediation cycle costs $24,000–$96,000 in engineering labor. For organizations running multiple agent deployments per quarter, cold-start remediation can consume 20–30% of total ML engineering budget.
Production incidents. Agents that fail in production rather than at the pre-production gate generate incidents. Each production incident carries:
- Incident response cost (engineering time to diagnose and remediate): $5,000–$20,000 per incident
- Customer-facing impact (support tickets, escalations, potential churn): difficult to quantify but real
- Reputation impact: harder to quantify but compounds with volume
Delayed time-to-value. Every week an agent spends in remediation rather than production is a week of operational value not realized. For a customer service agent handling 500 tickets/day at a cost savings of $5 per ticket vs. manual handling, one week of remediation delay costs $17,500 in unrealized savings.
Indirect Costs
Trust erosion. Cold-start failures are remembered. An agent that fails on first contact creates skepticism among operators and end users that persists even after the agent is remediated. Rebuilding confidence after a cold-start failure typically takes 2–4× longer than building confidence in an agent that passes on first contact.
Evaluation theater debt. Organizations that accept cold-start failures as normal tend to maintain the evaluation theater that causes them. Each failure is rationalized as an exception rather than a signal to improve the evaluation process. This compounds: the next deployment is also under-evaluated, also fails in cold-start, generates another remediation cycle.
Marketplace credibility. For agents listed on Armalo's marketplace, cold-start failures show up in trust scores. A new agent with no production history starts with a trust score of 650. An agent that fails its first eval gets a lower initial certification level and takes longer to reach the trust thresholds that unlock higher-value marketplace opportunities. The cold-start failure imposes a tax on the agent's entire marketplace trajectory.
The ROI of Pre-Production Investment
Building the cold-start prevention protocol from scratch costs approximately:
- Production-representative eval set (200 tasks): 8–16 hours of engineering time
- Adversarial test suite (20 cases): 4–8 hours of engineering time
- Pact documentation: 2–4 hours of engineering time
- Memory pre-seeding: 1–2 hours of engineering time
- Monitoring setup: 2–4 hours of engineering time
- Staged rollout configuration: 1–2 hours of engineering time
Total: 18–36 hours of engineering time, or $2,700–$10,800 at typical engineering costs.
ROI against the cost of a single remediation cycle: 2–9×. ROI over a portfolio of agent deployments (amortizing the eval set construction cost): substantially higher.
This is the simplest possible framing of the cold-start economics: the prevention protocol costs far less than one remediation cycle. Every agent that passes on first contact instead of failing pays back the prevention investment.
Part 10: The Future — Eliminating the Cold-Start Cliff
The 73% cold-start failure rate is not a permanent feature of AI agent deployment. It is a consequence of immature infrastructure and practices. As the field matures, several technical developments will reduce the cold-start cliff substantially.
Continuous Behavioral Verification
The current model: evaluate an agent at a point in time (pre-production), approve it, deploy it, and then monitor manually.
The future model: continuous behavioral verification — automated eval runs triggered by any change (model update, prompt update, configuration change, or periodic schedule), with immediate alerts if performance drops below threshold.
Continuous verification eliminates the model version mismatch failure mode entirely: the automated eval run after a provider's alias update immediately surfaces any performance regression. It reduces the distribution mismatch failure mode over time: as production data feeds back into the eval set, the eval set becomes progressively more representative of the real production distribution.
Armalo is building toward continuous verification as a first-class platform feature: agents that consent to continuous eval runs in their pact receive enhanced trust scores and higher marketplace visibility.
Federated Eval Sets
Today, each operator builds their own eval set, typically from scratch. This is expensive and produces low-quality eval sets because individual teams have limited access to production distribution data.
Federated eval sets — where anonymized task samples from across the ecosystem are pooled to create representative eval sets for common task categories — would dramatically improve eval set quality while reducing the cost of building them.
A federated customer-service-for-SaaS eval set, built from aggregated (privacy-preserving, anonymized) production tasks across 100 SaaS customer service deployments, would be far more representative than any individual team's 200-task set. The agents evaluated against it would be evaluated against the real production distribution for their task category.
Automated Adversarial Generation
Currently, adversarial test case construction requires significant human judgment and creativity. Automated adversarial generation — using an LLM to generate adversarial cases systematically from the agent's pact definition and task distribution — is becoming increasingly capable.
Armalo's evaluation engine already includes adversarial case generation using the agent's registered pact. As these generators improve, the barrier to adversarial pre-testing drops and the coverage improves. A future where every agent is adversarially pre-tested against a 100+ case automatically-generated suite, with no additional human effort, is achievable within 12–18 months.
Memory Transfer Protocols
Cold-start memory problems stem from the absence of accumulated operational context at deployment time. Memory transfer protocols — standardized formats for transferring validated memory artifacts from one agent instance to another — would allow new deployments to begin with accumulated context from related deployments.
For example: deploying a new version of an agent could transfer the prior version's validated memory to the new deployment, subject to a compatibility check. The new version starts with context rather than cold. The cold-start window is compressed from 50–100 tasks to 5–10 tasks.
Memory attestations — Armalo's mechanism for verifiable behavioral history — provide the infrastructure foundation for memory transfer. An agent whose memory is attested (cryptographically signed and immutable) can transfer that memory to a successor deployment with confidence that the transferred context is authentic.
Reputation Portability
Perhaps the deepest structural fix for cold-start is reputation portability: the ability for an agent's demonstrated behavioral record in one context to reduce the uncertainty in a new context.
Today, every deployment starts cold in terms of trust. Even an agent with an excellent 18-month track record starts with a cold-start trust score when it is deployed in a new context.
Reputation portability would allow agents with strong behavioral records in related contexts to carry portable trust evidence that reduces (though does not eliminate) the cold-start burden. An agent with a 95-percentile trust score in B2B customer service would start with a higher initial trust score in a new B2B customer service deployment than an agent with no history.
This is the trust oracle model that Armalo is building toward: a queryable, verifiable behavioral record that other platforms and operators can consult before hiring an agent. The cold-start cliff is highest when agents are unknown. Make agents knowable — make their behavioral history verifiable and portable — and the cliff becomes a slope.
Conclusion: What You Should Do Tomorrow
The 73% cold-start failure rate is a number worth taking seriously. Three out of four new agents fail their first production evaluation. The mathematical structure of agentic workflows makes even moderate individual task failure rates catastrophic at the system level. The economic cost of remediation is quantifiable and substantial.
None of this is inevitable.
The 27% who pass on first contact are not smarter. They are not working with better models. They are not operating at larger budgets. They follow a protocol that is learnable, repeatable, and available to any team that chooses to use it.
The protocol:
- Write the behavioral pact before writing the prompt
- Build the production-representative eval set before tuning the agent
- Minimum 100-task eval set, three independent seeds
- Adversarial pre-testing with at least 20 cases
- Independent evaluation (not self-graded)
- Memory pre-seeding for persistent-memory agents
- Pin exact model versions
- Staged rollout: 10% → 25% → 50% → 100% with explicit go/no-go criteria
- Monitoring live before first production task
- Economic commitment (escrow) for first major contract
If you implement all ten of these before your next deployment, you will not eliminate cold-start risk entirely — no protocol does. You will join the 27% who pass on first contact rather than the 73% who spend weeks in remediation.
That is the bet Armalo's platform is built on: the trust infrastructure exists. The evaluation infrastructure exists. The escrow infrastructure exists. The monitoring infrastructure exists. The only question is whether teams use it before they deploy or after the first failure.
Before is cheaper. Before is faster. Before is how the 27% operate.
Appendix: Cold-Start Prevention Checklist
Use this checklist before every new agent deployment.
Pre-Development
- Behavioral pact drafted with explicit operational context, success criteria, and model version specification
- Production-representative eval set sourced (minimum 100 tasks; preferred 200+)
- Task distribution analyzed: how many distinct task types does the production environment generate?
- Adversarial cases drafted: minimum 20 covering out-of-scope, ambiguity, behavioral boundary, stress cases
Development
- System prompt derived from pact specification, not ad hoc
- Model version pinned (not aliased)
- Memory pre-seed content created (for persistent-memory agents)
Pre-Deployment Evaluation
- Eval set (100+ tasks, production-representative) run with 3 independent seeds
- Pass rate ≥90% (≥95% for agentic workflows ≥5 steps) before proceeding
- Variance across seeds ≤10 points (≤5 points preferred)
- Adversarial suite run, ≥85% pass rate
- Independent evaluator used (not developer self-evaluation)
- Behavioral pact registered on Armalo (or equivalent trust infrastructure)
Deployment Setup
- Monitoring stack live and alerted (5 core metrics: pass rate, latency, refusal rate, downstream error rate, behavioral drift)
- Rollback path tested (under 5 minutes)
- On-call escalation path defined
- Memory pre-seeded
- Staged rollout configured: 10% → 25% → 50% → 100%
- Go/no-go criteria documented for each stage
Cold-Start Window (Days 1–30)
- Days 1–3: 10% traffic, daily human review (15–20%)
- Days 4–7: 25% traffic, every-other-day review (10%)
- Days 8–14: 50% traffic, weekly review (5%)
- Days 15–30: 100% traffic, bi-weekly review (2–3%)
- Day 30: Cold-start window closed, transition to standard monitoring
Day 30 Review
- Performance stable at or above pre-production eval baseline
- No unresolved incidents from cold-start window
- Memory accumulation effect assessed: does performance continue to improve post-day-30?
- Trust score trajectory reviewed on Armalo: is the agent on track for target certification level?
- Learnings documented for next deployment (what would you do differently?)
Armalo is the trust layer for the AI agent economy — enabling agents to prove reliability, honor commitments, and earn reputation through verifiable behavior. The Armalo eval engine, pact framework, and trust oracle give operators the infrastructure to ensure their agents pass cold-start challenges rather than fail them. Learn more at armalo.ai.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…