From Vibes to Verification: How to Actually Evaluate an AI Agent
Benchmark scores measure task completion on curated inputs. They tell you almost nothing about how an agent will behave when inputs are adversarial, ambiguous, or outside its training distribution. Here is what actual evaluation looks like.
Continue the reading path
Topic hub
Agent EvaluationThis page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
The Problem with Vibes-Based Evaluation
Every AI agent ships with a story about how well it works. The story includes impressive demo clips, benchmark citations, and customer testimonials. It is compelling. It is also structurally insufficient as the basis for a deployment decision.
The problem is not that these evaluations are dishonest. The problem is that they measure performance on curated inputs, under favorable conditions, against tasks that were selected because the agent handles them well. They are the equivalent of evaluating a surgeon by watching them perform their best cases β valuable information, but not a reliable predictor of how they will perform under pressure, in unfamiliar situations, or when the expected answer is ambiguous.
Actual agent evaluation is different from demo evaluation in one critical dimension: it is adversarial. It seeks failure modes, not success cases. It measures what the agent does when it does not know the answer, when the input is designed to confuse it, when the task is outside its training distribution, and when a sophisticated actor is deliberately trying to make it fail.
Here is what that looks like in practice.
Why Self-Reported Evaluation Is Theater
Before describing what good evaluation looks like, it is worth being precise about why self-reported evaluation β the kind where the agent's creator runs the tests and publishes the results β is not sufficient for deployment decisions.
Every claim in this post becomes a Sentinel eval. Add adversarial trust checks to your CI in 10 minutes.
Add Sentinel to CI βSelf-reported evaluation has three structural problems:
Selection bias in test construction. The team that builds an agent knows its failure modes. When they design evaluation suites, they unconsciously (or consciously) avoid the inputs where the agent struggles. The evaluation suite is not a random sample of the input distribution the agent will encounter in production β it is a curated sample biased toward the agent's strengths.
Distribution shift between eval and production. Even well-intentioned evaluation suites do not capture the tail of the input distribution that production traffic will generate. Users and adversaries find inputs that evaluators did not anticipate. The agent that performs well on the eval suite will encounter inputs in production that the eval suite never tested.
Incentive misalignment in evaluation design. The team that built the agent has a financial interest in a favorable evaluation. This does not mean they lie. It means that when there is a judgment call about whether a test case is a failure, they tend toward the charitable interpretation. Adversarial evaluation requires the opposite posture.
The solution to all three problems is the same: evaluation conducted by someone other than the agent's creator, against inputs designed to find failure modes rather than confirm capability.
The 12 Dimensions of Composite Trust Scoring
A composite trust score aggregates performance across multiple behavioral dimensions. The specific dimensions and their weights reflect the predictive structure of agent reliability in production contexts.
The 12 dimensions used in Armalo's composite scoring, with their relative weights:
Accuracy (14%)
Does the agent produce outputs that are factually correct and well-reasoned? This dimension is tested with inputs that have verifiable ground truth β factual questions, arithmetic tasks, code execution, document analysis. The key measurement is not just whether the agent gets the right answer but whether it gets the right answer consistently across the distribution of inputs that pattern-match to the same task type.
Reliability (13%)
Does the agent complete tasks within defined latency and success rate bounds? Reliability measures the consistency of task completion under normal load, under high load, and under edge-case inputs. An agent that achieves excellent accuracy on 80% of inputs and fails silently on the remaining 20% is less reliable than one with slightly lower average accuracy but higher consistency.
Safety (11%)
Does the agent avoid harmful outputs, sensitive data exposure, and adversarial injection? Safety testing uses inputs specifically designed to elicit harmful outputs: jailbreak attempts, indirect injection via tool outputs, escalating instruction conflicts. The safety score measures resistance to these inputs, not just performance on standard inputs.
Self-Audit β Metacal (9%)
Does the agent know what it knows and flag uncertainty rather than confabulate? Metacal is shorthand for metacognitive calibration. An agent with good Metacal scores produces confident answers when it is correct and uncertain answers when it is uncertain. An agent with poor Metacal scores produces confident-sounding answers even when it is fabricating. This dimension is measured by comparing agents' stated confidence levels against their actual accuracy across a large test set.
Security (8%)
Does the agent resist prompt injection, data exfiltration attempts, and boundary violations? Security testing uses adversarial inputs designed to make the agent act outside its defined scope β extracting information it should not have access to, executing actions it is not authorized to perform, or being used as a conduit for attacks against connected systems.
Bond Integrity (8%)
Has the agent staked economic value against its behavioral commitments? Bond integrity measures whether the agent has made credible economic commitments backing its behavioral claims. Agents that are willing to stake substantial bonds against pact compliance are making verifiable commitments with real consequences for violation. This dimension measures the amount and conditions of bonding as a proxy for commitment credibility.
Latency (8%)
Does the agent respond within acceptable performance bounds? Latency is measured under normal conditions, under load, and at the tail of the distribution. P99 latency is often more operationally significant than average latency, because tail latency affects the worst-case user experience.
Scope Honesty (7%)
Does the agent operate within its defined behavioral scope or drift beyond it? Scope honesty testing presents the agent with requests that are near the edge of its defined scope β designed to test whether the agent correctly identifies scope boundaries and declines or escalates appropriately rather than attempting to respond out of scope.
Cost Efficiency (7%)
Does the agent use compute resources in proportion to task complexity? Cost efficiency measures whether the agent's resource consumption is appropriate for the tasks it completes. An agent that uses maximum compute for every task, regardless of complexity, is less efficient than one that scales resource consumption to task requirements.
Model Compliance (5%)
Does the agent follow the policies of its underlying model provider? Model compliance measures adherence to the usage policies of the underlying model β avoiding uses that the model provider has explicitly prohibited and respecting the intended use cases of the model.
Runtime Compliance (5%)
Does the agent respect the operational constraints of its deployment environment? Runtime compliance measures whether the agent operates within the resource constraints, rate limits, and operational policies of its deployment environment.
Harness Stability (5%)
Does the agent perform consistently across evaluation environments, not just favorable ones? Harness stability tests whether the agent's evaluation scores are consistent when the evaluation harness is varied β different prompting strategies, different evaluation frameworks, different judges. Agents that perform well only on specific evaluation setups are less reliably evaluated than agents with consistent scores across evaluation methodologies.
Why These Specific Dimensions
The 12-dimension structure reflects empirical analysis of agent failure modes in production. The dimensions are not derived from first principles β they are derived from observing what actually causes agents to fail in ways that matter.
The highest-weighted dimension, accuracy (14%), reflects that factual correctness is the foundational property on which everything else depends. An agent that cannot produce accurate outputs fails regardless of its other properties.
The second-highest, reliability (13%), reflects that inconsistency is often more damaging than consistent mediocrity. An enterprise can design workflows around an agent's limitations if those limitations are predictable. Unpredictable failure modes are structurally harder to manage.
The self-audit dimension (Metacal, 9%) appears disproportionately high for something that sounds epistemological. It reflects a specific empirical finding: agents with poor metacognitive calibration cause disproportionate harm relative to their accuracy scores, because they produce confident-sounding errors. An agent that says "I'm not sure" when it does not know can be corrected. An agent that confidently confabulates produces outputs that operators and users take on face value.
What Adversarial Evaluation Actually Looks Like
Adversarial evaluation is not just about testing with hard inputs. It is about adopting the adversarial mindset: trying to make the agent fail in ways that matter.
For a customer service agent, adversarial evaluation includes:
- Escalation avoidance tests: Inputs that should trigger escalation, presented in ways designed to make the agent respond rather than escalate
- Scope boundary probing: Requests near the edge of authorized scope, testing whether the agent correctly identifies the boundary
- Confidence calibration tests: Questions with no good answer, testing whether the agent declines or confabulates
- Injection attempts: Inputs containing instructions in unusual formats (embedded in documents, user-provided context, tool outputs) that attempt to override the agent's instructions
- Consistency tests: The same question asked in different ways, testing whether the agent produces consistent answers or reveals unstable internal representations
- Endurance tests: Long conversations that test whether the agent maintains its behavioral commitments across context accumulation
Each of these test categories targets a specific failure mode. The results are scored against defined criteria and aggregated into dimension scores that feed the composite trust score.
The Jury Model for Evaluation Accuracy
A single evaluator is subject to the biases and blind spots of that evaluator. Multi-provider jury evaluation β using multiple independent evaluators with different architectures, training data, and inherent biases β produces more reliable evaluation scores than any single evaluator.
The jury model applies the wisdom-of-crowds principle to agent evaluation: aggregate judgments from diverse evaluators are more accurate and less biased than any individual judgment. The specific mechanism matters: a jury that simply takes the mean of judge scores is less reliable than one that uses sophisticated aggregation that accounts for judge disagreement, identifies outlier judgments, and weights judges by their demonstrated accuracy on cases with known ground truth.
The practical implication: evaluation results from a single judge β even a highly capable one β should be treated as preliminary. Evaluation results from a diverse jury, properly aggregated, are the gold standard for composite trust scoring.
The Compound Effect of Evaluation History
A single evaluation is a snapshot. The most valuable evaluation artifact is a behavioral history: a record of evaluation scores across time, across evaluation environments, and across the agent's operational lifetime.
Behavioral history reveals properties that single snapshots cannot:
Score stability: Does the agent maintain consistent scores across evaluations, or does performance vary significantly between evaluation runs? Instability suggests that benchmark performance is not representative of deployment performance.
Trend direction: Is the agent improving, stable, or declining? Models fine-tuned on production traffic may drift over time. Regular evaluation reveals the direction of drift before it becomes operationally significant.
Failure mode consistency: Does the agent fail in consistent, predictable ways, or does it fail randomly? Consistent failure modes are manageable through operational controls. Random failure modes require more conservative deployment.
The composite trust score is a function of current evaluation scores weighted by behavioral history. An agent with a short history has more uncertainty in its score. An agent with an extensive history has a score that reflects demonstrated performance rather than single-point measurement.
The Evaluation You Cannot Buy
You can buy an evaluation run from almost any AI vendor. You can commission benchmark testing from several firms. What you cannot buy is behavioral history β the record of an agent's actual performance across thousands of evaluations over months or years.
This asymmetry is intentional. Behavioral history has to be earned through actual performance. An agent that launched six months ago with mediocre evaluation scores has the same historical record as it actually has β it cannot purchase a better one.
This is the most important property of composite trust scoring as a market mechanism. It takes something that cannot be fabricated β behavioral history over time β and makes it legible to buyers. The trust score is not a claim about what the agent can do. It is a record of what the agent has done.
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness β what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading commentsβ¦