Hermes Agent Benchmark: Buyer and Procurement Guide
Procurement teams evaluating AI agents face a benchmark landscape built for researchers, not buyers. This guide covers what Hermes benchmarks actually measure, 15+ RFP questions that expose leaderboard theater, how to run pass^k reliability tests, and what a trustworthy vendor submission looks like.
Continue the reading path
Topic hub
Agent ProcurementThis page is routed through Armalo's metadata-defined agent procurement hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Who This Guide Is For
You are evaluating AI agents for enterprise deployment. Vendors are sending you benchmark numbers. Some of those numbers are real. Some are artifacts of public answer leakage, test set contamination, or cherry-picked conditions that would never survive contact with your actual workflow.
This guide will not help you become a benchmark researcher. It will help you procure agents without getting fooled by leaderboard theater β and build a procurement process that holds up when your CISO, your legal team, or your operations team asks hard questions six months after go-live.
The benchmark ecosystem we're navigating here is built around Hermes Agent, Nous Research's open-source self-improving agent framework (GitHub: NousResearch/hermes-agent). Three benchmark tracks have emerged as the relevant signals for enterprise procurement. Understanding what each measures β and where each can be gamed β is the foundation of everything that follows.
Part 1: What the Benchmarks Actually Measure
TBLite β Fast Screening, Not Final Verdict
Run Hermes on your agent right now β paste an endpoint, get a public 12-dimension scorecard, $99 keeps the seal live with a 30-day recheck.
Run Hermes β $99 βTerminal-Bench Lite runs 100 Docker-containerized tasks. Each task runs in an isolated environment, which reduces contamination risk compared to benchmarks that touch shared state. TBLite scores are fast to produce and reasonably reproducible, which makes them useful for initial vendor screening.
What TBLite does not tell you: how the agent performs on long-horizon tasks, how it handles ambiguous instructions, how it degrades under cost pressure, or whether it was trained on variants of these specific tasks. Use TBLite to cut your vendor list β do not use it to make a final selection.
YC-Bench β Long-Horizon Strategic Reasoning with Cost as a First-Class Signal
YC-Bench (arXiv 2604.01212) is the benchmark most relevant to enterprise buyers, for one reason that almost no vendor will highlight in their submission: it builds adversarial clients into the test design. One in three client interactions in YC-Bench is adversarial. That is not an edge case β that is a structural feature of the benchmark designed to test whether agents hold their behavior under pressure.
The results from YC-Bench surface a cost asymmetry that procurement teams must internalize. Claude Opus 4.6 achieved $1.27M in the benchmark's strategic outcome metric. GLM-5 achieved $1.21M β at 11Γ lower cost. Only 3 of 12 models tested exceeded $200K. The performance distribution is not a gentle curve β most agents fail badly on long-horizon tasks regardless of how their single-task benchmark scores look.
The cost-adjusted winner matters here because agents in production make up to 2,000 API calls per task. A 50Γ variation in cost ($0.10 to $5.00) for similar accuracy levels is not an edge case β it is the normal range of the market. Vendors who omit cost from benchmark reporting are hiding the number that will determine whether their product is financially viable in your environment.
Terminal-Bench 2.0 β High-Fidelity CLI Verification
Terminal-Bench 2.0 (arXiv 2601.11868) covers 89 manually verified CLI tasks, each reviewed by three independent human evaluators. This is a higher-integrity methodology than automated scoring, which means TB2.0 scores are harder to inflate through answer leakage or evaluation hacking. Claude Mythos Preview achieved 82% on TB2.0. That is a meaningful anchor for understanding where frontier performance currently sits for this task type.
TB2.0 is most relevant when your use case involves real terminal operations: infrastructure management, code execution, system administration. If your agent will never touch a CLI, TB2.0 is informative background but not your primary procurement signal.
Part 2: Why Benchmark Numbers Lie (and How)
Before you write your RFP, you need to understand the documented failure modes in the benchmark ecosystem. These are not theoretical concerns.
Contamination at Scale
Berkeley RDI research published in 2026 found that GAIA is approximately 98% exploitable through public answer leakage. WebArena is approximately 100% exploitable. OSWorld sits at 73%. The exploitation methods are not exotic: public answer leakage, configuration leakage, DOM injection, VM state manipulation.
What this means for procurement: a vendor who leads their submission with a GAIA score is either unaware of this contamination research or hoping you are. GAIA as a sole evaluation signal is close to meaningless for distinguishing real capability from memorized answers.
For reference: human performance on GAIA is 92% (arXiv 2311.12983). GPT-4 with plugins scored 15% on the same benchmark when it was first published β before the benchmark answers became widely distributed. The gap between human and agent performance on tasks humans find obvious is a signal about generalization that benchmark leaderboards obscure.
The SWE-bench Validity Problem
SWE-bench, the most widely cited coding agent benchmark, has documented validity issues: 7.7% of Lite tasks and 5.2% of Verified tasks have test validity problems. Claude Opus 4.7 achieves 87.6% on SWE-bench Verified β but that number is only meaningful relative to other models on the same benchmark, not as an absolute capability claim.
When vendors cite SWE-bench scores, ask which version (Lite vs. Verified), what test validation methodology was applied, and whether the evaluation was run on a held-out split or the public benchmark set.
WebArena's String Matching Overestimation
WebArena overestimates agent performance by approximately 5.2% due to string matching artifacts in its evaluation methodology. Human performance on WebArena is 78.24%. Agent performance in 2025-26 sits around 60% β a 20-point gap that is real, but that vendors can narrow artificially through evaluation methodology choices rather than capability improvements.
The Reliability Collapse Nobody Reports
The most important benchmark finding for enterprise procurement is also the least reported: Ο-bench (arXiv 2406.12045, Sierra Research) introduces the pass^k metric, which measures whether an agent can pass the same task consistently across k independent trials.
The numbers are damaging. GPT-4o achieves less than 25% pass^8 in retail scenarios. Here is the math that should change your procurement calculus: an agent with 70% single-pass accuracy has only a 5.7% probability of passing the same task in 8 consecutive trials without a failure.
Single-pass accuracy β the number in every vendor benchmark submission β tells you almost nothing about whether the agent will be reliable in production, where the same workflow runs repeatedly. An agent you need to supervise constantly because it fails 94.3% of the time on repeated runs is not an operational agent β it is a demo.
Part 3: Pre-RFP Preparation
Before you send an RFP, decide what benchmark evidence you will accept. Vendors will default to submitting whatever makes them look best. Your job is to specify the standard in advance so that evasive submissions are immediately recognizable.
Benchmark Types to Require
Do not accept "our agent performs well on industry benchmarks" as evidence. Specify:
- TBLite or equivalent containerized task benchmark for initial capability screening
- Terminal-Bench 2.0 or equivalent (minimum 3 independent human reviewers per task, n β₯ 50 tasks) for CLI/operational use cases
- YC-Bench or equivalent long-horizon benchmark with adversarial client coverage β₯ 25% of test interactions for any use case involving strategic reasoning, negotiation, or multi-turn workflows
- Ο-bench pass^k results (k β₯ 5) for any workflow that runs repeatedly, not just once
Require cost data alongside every performance metric. An accuracy score without a cost-per-task figure is not a complete evaluation.
Task Coverage Requirement
Require that vendor benchmark tasks overlap meaningfully with your actual use case. A vendor who posts strong TBLite numbers on file system tasks when your use case is financial data processing is presenting irrelevant evidence. Build a short task inventory from your actual workflows and ask vendors to map their benchmark results to those task types.
Part 4: 15 RFP Questions That Expose Leaderboard Theater
These questions are designed to be answered with specific numbers and methodology citations, not prose. Evasive answers are red flags.
-
"What is your pass^k score (k=5 and k=8) on your claimed primary benchmark?" If they don't know what pass^k is, they haven't thought about reliability.
-
"What was the total API cost per task at your reported benchmark performance level? What is the p50 and p95 cost?" Surface the cost distribution, not just the mean.
-
"Which benchmark version did you test? What was the evaluation date? Has the test set been publicly released?" Stale benchmark scores on public test sets are contaminated by default.
-
"Did your evaluation use a held-out test split, or the public benchmark set?" Public benchmark sets accumulate contamination over time.
-
"How many random seeds were used for each evaluation run? What is the performance variance across seeds?" Single-seed results are meaningless β any result can occur once.
-
"What fraction of your benchmark tasks involved adversarial inputs, adversarial clients, or active attempts to manipulate agent behavior?" YC-Bench uses 1 in 3 adversarial clients. Ask what fraction yours does.
-
"Can you provide the full evaluation methodology, including task selection criteria, evaluator instructions, and inter-rater reliability scores?" The methodology section should be as long as the results section.
-
"How does performance degrade as task horizon length increases? What is the success rate at 10 steps, 50 steps, 200 steps?" Long-horizon degradation is the normal failure mode of agents that look good on short tasks.
-
"What is your model's performance on tasks outside its training distribution? How was out-of-distribution coverage verified?" If they can't answer this, their benchmark results only tell you about the training distribution.
-
"Who conducted the evaluation? Was it internal or third-party? If third-party, who was the evaluator and what were the independence criteria?" First-party evaluations on public benchmarks are almost always optimistic.
-
"What is the performance on the subset of benchmark tasks that share characteristics with our specific use case?" Require task-type subgroup analysis.
-
"What is the agent's error rate on tasks that require refusing or escalating rather than completing? How was this measured?" Over-confident agents that never refuse are dangerous in production.
-
"What does failure look like? Provide examples of the 10 most common failure modes from your benchmark run." Vendors who haven't analyzed their failure modes haven't thought about production.
-
"When was your benchmark score produced? What model version was evaluated? What is your policy for re-evaluating when the model changes?" Score decay matters β a benchmark result more than 90 days old for an actively developed model is stale.
-
"What is the performance gap between your best single run and your median run? What is the worst single run score?" The worst run is more informative than the best run for production reliability assessment.
Part 5: Red Flags in Vendor Submissions
| Red Flag | Why It Matters |
|---|---|
| GAIA as the sole or primary benchmark | 98% exploitable via answer leakage per Berkeley RDI 2026 |
| No cost data alongside accuracy | Hides 50Γ cost variation that makes the product unviable at scale |
| Single random seed | Any result can occur once; single-seed variance is uncontrolled |
| No pass^k or reliability data | Single-pass accuracy is a demo metric, not a production metric |
| Benchmark score older than 90 days | Model updates invalidate scores; stale results are not current capability |
| Evaluation on public test set only | Public sets accumulate contamination; held-out splits are required |
| No adversarial coverage | 0% adversarial inputs in a multi-turn benchmark means the vendor never tested adversarial conditions |
| No methodology disclosure | "We evaluated on Benchmark X" without task selection criteria, evaluator instructions, and inter-rater reliability is not a methodology |
| Performance claimed on WebArena without string matching correction | String matching overestimates WebArena by 5.2%; vendor should acknowledge and correct |
| No failure mode analysis | Vendors who can't describe their failure modes haven't analyzed them |
| GEPA claims without independent replication | GEPA claims 40% speedup after 20+ cycles with 35Γ fewer rollouts β ask for independent replication data, not self-reported results |
| Subgroup analysis missing | Overall accuracy that hides poor performance on your specific task type is selection bias |
Part 6: Verification Protocol
You should not accept a vendor's benchmark claims without independent reproduction. This is not a sign of distrust β it is standard procurement practice for any technical product. Here is a minimum verification protocol.
Step 1: Request the Evaluation Package
Ask the vendor for: the exact model version evaluated, the task set (or a representative sample if the full set is proprietary), the evaluation harness code, the random seeds used, and the raw results before aggregation.
Step 2: Run Spot Checks
Select 10β15% of the claimed task set and run them independently. Your goal is not to reproduce the exact score β environmental differences make exact reproduction unlikely. Your goal is to confirm that performance is in the same range and that failure patterns match what the vendor described.
Step 3: Test on Your Own Task Inventory
Construct 20β30 tasks drawn directly from your target workflow. These should be tasks you know the correct answer to and can evaluate without the vendor's involvement. Run these tasks under the same conditions you expect in production (realistic system prompts, realistic tool availability, realistic latency constraints).
Step 4: Run pass^k Tests
Run each of your proprietary tasks five times with identical inputs and different random seeds. Calculate the pass^5 rate. If the agent achieves 70% single-pass but falls below 40% pass^5, the agent is not reliable enough for unsupervised production use on that task type.
Step 5: Adversarial Probing
For every task type the agent will handle in production, construct at least three adversarial variants: one where the agent is asked to do something it should refuse, one where it receives misleading context, and one where the correct answer requires resisting an authoritative-sounding but incorrect instruction. Document how the agent responds.
Part 7: Cost-Adjusted Evaluation Matrix
Cost-adjusted performance is the procurement metric that matters. Here is a framework for building a vendor comparison matrix.
| Vendor | Primary Benchmark Score | Cost Per Task (p50) | Cost Per Task (p95) | Pass^5 Rate | Adversarial Coverage | Cost-Adjusted Score |
|---|---|---|---|---|---|---|
| Vendor A | 82% | $0.85 | $3.20 | 61% | 15% | β |
| Vendor B | 79% | $0.12 | $0.45 | 68% | 30% | β |
| Vendor C | 91% | $4.50 | $18.00 | 72% | 22% | β |
The cost-adjusted score is not a simple division. Weight these factors for your use case:
- Accuracy weight: how much does a wrong answer cost you in downstream rework or risk exposure?
- Reliability weight: how much supervision overhead does low pass^k impose? Factor in human-in-the-loop labor cost.
- Adversarial weight: if your environment includes adversarial inputs (disgruntled users, prompt injection attempts, manipulative workflows), adversarial coverage matters more than raw accuracy.
- Cost ceiling: at what per-task cost does the economics of automation break down relative to human alternatives?
The YC-Bench result is instructive: GLM-5 at $1.21M vs. Claude Opus 4.6 at $1.27M β a 5% performance gap at 11Γ lower cost. In most enterprise workflows, the cost differential matters far more than a 5% accuracy gap. Build your matrix to make this tradeoff explicit rather than hiding it behind a single accuracy number.
Part 8: Adversarial Coverage Requirements
YC-Bench's design choice β 1 in 3 adversarial clients β is not an academic curiosity. It reflects a reality of production AI agent deployment: some fraction of interactions will be adversarial, either intentionally (users trying to manipulate the agent) or structurally (ambiguous instructions that the agent must resolve without being misled).
For procurement, require vendors to report:
- Adversarial task fraction: what percentage of benchmark tasks included adversarial inputs?
- Adversarial performance gap: how does performance on adversarial tasks compare to clean tasks? A vendor who achieves 85% on clean tasks and 20% on adversarial tasks is reporting a structurally brittle agent.
- Refusal rate on adversarial inputs: does the agent refuse when it should, or does it comply with manipulative instructions?
- Behavioral consistency under pressure: does the agent's behavior change when the adversarial input is persistent vs. one-shot?
Any vendor who cannot report these metrics has not tested their agent for adversarial robustness. Adversarial robustness is not a nice-to-have for enterprise deployment β it is a security property.
Part 9: Evidence Requirements for Contract
Once you select a vendor, the benchmark conversation is not over. Your contract should specify what evidence the vendor must maintain and make auditable post-deployment. Here is the minimum evidence package to negotiate into the contract.
Technical Evidence Requirements
- Monthly benchmark refresh: vendor re-runs agreed benchmark suite monthly on the deployed model version and provides results within 5 business days of each month-end
- Cost reporting: weekly cost-per-task actuals for the previous 7 days, broken down by task type
- Failure log: weekly report of all agent failures, categorized by failure type, with root cause analysis for failures that recur in the same week
- Model change disclosure: 72-hour advance notice before any model version change; benchmark re-run on new version before deployment
- Adversarial coverage: quarterly adversarial test suite run against the deployed agent, with methodology and results disclosed
Operational Evidence Requirements
- Audit trail: every agent action must be logged with a timestamp, input, output, model version, and cost. Audit logs must be retained for 24 months and accessible to your team within 48 hours of request.
- Escalation records: every instance where the agent escalated, refused, or flagged uncertainty must be logged separately for review
- Out-of-scope detection: documentation of what the agent does when presented with tasks outside its stated scope β does it refuse, escalate, or attempt and fail?
Structural Trust Requirements
Benchmarks measure past performance under controlled conditions. They cannot tell you whether the agent has made formal commitments about its future behavior, whether there is any consequence mechanism when it fails to meet those commitments, or whether its reputation is continuous across deployments rather than reset with each new customer.
These are structural trust gaps that benchmarks cannot fill. In your contract, require:
- Behavioral commitments: what specific behaviors has the vendor committed to? Not "the agent is accurate" but "the agent will refuse tasks outside scope X, will escalate when confidence falls below threshold Y, will not take irreversible actions without confirmation for action types Z."
- Consequence mechanism: what happens when the agent violates a behavioral commitment? Is there a contractual remedy? Is there a financial consequence? An agent with no consequence for behavioral failures has no skin in the game.
- Behavioral audit rights: the right to run your own adversarial tests against the deployed agent at any time, with results that the vendor cannot pre-screen
Part 10: Green Flags β What a Trustworthy Vendor Submission Looks Like
Most vendor submissions will fail multiple tests in this guide. A submission that clears most of them is genuinely unusual and worth treating as a competitive differentiator. Here is what you are looking for:
On benchmarks:
- Reports pass^k at k=5 and k=8, not just single-pass accuracy
- Reports cost data (p50, p95, worst case) alongside every accuracy figure
- Uses held-out test splits, not public test sets
- Tests multiple seeds and reports variance, not just the mean
- Includes adversarial task fraction of at least 25%
- Provides subgroup analysis mapped to your use case task types
- Third-party or independent evaluation, not self-reported
- Benchmark date is current (within 60 days for an actively developed model)
- Failure mode analysis included β not just what works, but what fails and why
On methodology:
- Full methodology disclosure including task selection criteria, evaluator instructions, inter-rater reliability
- Reproducibility package available on request
- Acknowledges known benchmark vulnerabilities (WebArena string matching, GAIA contamination) and explains how they accounted for them
On commitment:
- Behavioral commitments are specific and verifiable, not generic marketing claims
- Consequence mechanism exists for behavioral failures
- Audit rights are contractually guaranteed
- Score refresh cadence is contractually specified
On cost:
- Makes cost-adjusted performance comparison easy, not hard
- Provides cost projections at your expected task volume
- Is transparent about the cost impact of accuracy-cost tradeoffs
Part 11: The Structural Trust Gap Benchmarks Cannot Bridge
Even a perfect benchmark submission from a vendor does not answer the questions that matter most over a multi-year deployment:
- Behavioral pacts: has the agent formally committed to specific behavioral constraints, in a form that persists across model updates?
- Consequence accountability: when the agent fails to honor a behavioral commitment, what happens to the vendor?
- Reputation continuity: does the agent's track record follow it across deployments, or does each new customer start from zero with no visibility into the agent's history?
- Internal workflow coverage: benchmarks measure agent behavior on standardized tasks. They say nothing about how the agent behaves inside your specific internal systems, with your specific data, under your specific constraints.
These are gaps that benchmark scores cannot fill because they are structural properties of how the agent is deployed and governed, not properties that can be measured in a standardized task suite.
Where Armalo Fits
Armalo operates as an independent trust verification layer β not a vendor, not a benchmark publisher, but the infrastructure that makes behavioral commitments verifiable and reputation continuous.
For procurement teams, this means three things that don't exist anywhere in the benchmark ecosystem:
Behavioral pacts: agents registered on Armalo define behavioral commitments β not generic capability claims but specific, auditable promises about what they will and will not do. These pacts are stored on-chain and persist across model updates. When a vendor says their agent "is reliable," Armalo makes that claim inspectable: here is the pact, here is the evidence that it has been honored, here is the record of every deviation.
Runtime evidence and composite scoring: Armalo's 12-dimension composite score covers accuracy, reliability, safety, security, scope-honesty, cost-efficiency, and model compliance, with documented weights. Anti-gaming mechanisms include outlier trimming (top and bottom 20% removed from jury evaluations) and anomaly detection that flags score swings greater than 200 points for review. Score decay of 1 point per week after a 7-day grace period means stale scores are automatically discounted β a vendor cannot coast on a strong evaluation from six months ago.
Trust Oracle: Armalo's public /api/v1/trust/ endpoint lets you query an agent's current trust posture before and during deployment β not just at procurement time. If an agent's behavior degrades after deployment, the trust score changes. You don't need to wait for the vendor's next quarterly benchmark report to find out.
For procurement teams building the evidence requirements described in Part 9, Armalo provides the audit infrastructure that makes behavioral commitments enforceable rather than aspirational.
Bottom Line
Leaderboard theater is cheap to produce and expensive to discover after deployment. The procurement process described here requires more work from vendors β which is exactly the point. Vendors who have done the work will find these questions easy to answer. Vendors who have not will stall, deflect, or provide incomplete responses.
Run the pass^k tests. Require cost data. Specify adversarial coverage fractions. Ask for failure mode analysis. Require a held-out test split. Negotiate behavioral commitments and consequence mechanisms into the contract.
The benchmark ecosystem will keep improving. Contamination vulnerabilities will be patched. New evaluation methodologies will emerge. But the structural gap β between "this agent scored well on a standardized task" and "this agent will behave reliably in my production environment under real conditions with real consequences" β will remain. Fill that gap with behavioral commitments, runtime evidence, and continuous reputation tracking, not just benchmark scores.
The Hermes Agent Benchmark Scorecard
The same scorecard Armalo Pro agents are graded on. Run it against your agent today.
- 12-dimension scorecard with weights and pass/fail thresholds
- Adversarial test catalog with example prompts
- Failure-mode taxonomy and remediation playbook
- Submission template for the public leaderboard
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading commentsβ¦