Research

Hermes Agent Benchmark: Buyer and Procurement Guide

2026-04-1418 minArmalo Team

Procurement teams evaluating AI agents face a benchmark landscape built for researchers, not buyers. This guide covers what Hermes benchmarks actually measure, 15+ RFP questions that expose leaderboard theater, how to run pass^k reliability tests, and what a trustworthy vendor submission looks like.

Continue the reading path

Topic hub

Agent Procurement

This page is routed through Armalo's metadata-defined agent procurement hub rather than a loose category bucket.

Strategic Guide

Enterprise AI Agent Procurement

Curated Collection

Buyer Guides

Pro checkout

Turn this trust model into a scored agent.

Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.

Start Pro on Stripe Compare plans

Who This Guide Is For

You are evaluating AI agents for enterprise deployment. Vendors are sending you benchmark numbers. Some of those numbers are real. Some are artifacts of public answer leakage, test set contamination, or cherry-picked conditions that would never survive contact with your actual workflow.

This guide will not help you become a benchmark researcher. It will help you procure agents without getting fooled by leaderboard theater — and build a procurement process that holds up when your CISO, your legal team, or your operations team asks hard questions six months after go-live.

The benchmark ecosystem we're navigating here is built around Hermes Agent, Nous Research's open-source self-improving agent framework (GitHub: NousResearch/hermes-agent). Three benchmark tracks have emerged as the relevant signals for enterprise procurement. Understanding what each measures — and where each can be gamed — is the foundation of everything that follows.

Part 1: What the Benchmarks Actually Measure

TBLite — Fast Screening, Not Final Verdict

Run Hermes on your agent right now — paste an endpoint, get a public 12-dimension scorecard, $99 keeps the seal live with a 30-day recheck.

Run Hermes — $99 →

Terminal-Bench Lite runs 100 Docker-containerized tasks. Each task runs in an isolated environment, which reduces contamination risk compared to benchmarks that touch shared state. TBLite scores are fast to produce and reasonably reproducible, which makes them useful for initial vendor screening.

What TBLite does not tell you: how the agent performs on long-horizon tasks, how it handles ambiguous instructions, how it degrades under cost pressure, or whether it was trained on variants of these specific tasks. Use TBLite to cut your vendor list — do not use it to make a final selection.

YC-Bench — Long-Horizon Strategic Reasoning with Cost as a First-Class Signal

YC-Bench (arXiv 2604.01212) is the benchmark most relevant to enterprise buyers, for one reason that almost no vendor will highlight in their submission: it builds adversarial clients into the test design. One in three client interactions in YC-Bench is adversarial. That is not an edge case — that is a structural feature of the benchmark designed to test whether agents hold their behavior under pressure.

The results from YC-Bench surface a cost asymmetry that procurement teams must internalize. Claude Opus 4.6 achieved $1.27M in the benchmark's strategic outcome metric. GLM-5 achieved $1.21M — at 11× lower cost. Only 3 of 12 models tested exceeded $200K. The performance distribution is not a gentle curve — most agents fail badly on long-horizon tasks regardless of how their single-task benchmark scores look.

The cost-adjusted winner matters here because agents in production make up to 2,000 API calls per task. A 50× variation in cost ($0.10 to $5.00) for similar accuracy levels is not an edge case — it is the normal range of the market. Vendors who omit cost from benchmark reporting are hiding the number that will determine whether their product is financially viable in your environment.

Terminal-Bench 2.0 — High-Fidelity CLI Verification

Terminal-Bench 2.0 (arXiv 2601.11868) covers 89 manually verified CLI tasks, each reviewed by three independent human evaluators. This is a higher-integrity methodology than automated scoring, which means TB2.0 scores are harder to inflate through answer leakage or evaluation hacking. Claude Mythos Preview achieved 82% on TB2.0. That is a meaningful anchor for understanding where frontier performance currently sits for this task type.

TB2.0 is most relevant when your use case involves real terminal operations: infrastructure management, code execution, system administration. If your agent will never touch a CLI, TB2.0 is informative background but not your primary procurement signal.

Part 2: Why Benchmark Numbers Lie (and How)

Before you write your RFP, you need to understand the documented failure modes in the benchmark ecosystem. These are not theoretical concerns.

Contamination at Scale

Berkeley RDI research published in 2026 found that GAIA is approximately 98% exploitable through public answer leakage. WebArena is approximately 100% exploitable. OSWorld sits at 73%. The exploitation methods are not exotic: public answer leakage, configuration leakage, DOM injection, VM state manipulation.

What this means for procurement: a vendor who leads their submission with a GAIA score is either unaware of this contamination research or hoping you are. GAIA as a sole evaluation signal is close to meaningless for distinguishing real capability from memorized answers.

For reference: human performance on GAIA is 92% (arXiv 2311.12983). GPT-4 with plugins scored 15% on the same benchmark when it was first published — before the benchmark answers became widely distributed. The gap between human and agent performance on tasks humans find obvious is a signal about generalization that benchmark leaderboards obscure.

The SWE-bench Validity Problem

SWE-bench, the most widely cited coding agent benchmark, has documented validity issues: 7.7% of Lite tasks and 5.2% of Verified tasks have test validity problems. Claude Opus 4.7 achieves 87.6% on SWE-bench Verified — but that number is only meaningful relative to other models on the same benchmark, not as an absolute capability claim.

When vendors cite SWE-bench scores, ask which version (Lite vs. Verified), what test validation methodology was applied, and whether the evaluation was run on a held-out split or the public benchmark set.

WebArena's String Matching Overestimation

WebArena overestimates agent performance by approximately 5.2% due to string matching artifacts in its evaluation methodology. Human performance on WebArena is 78.24%. Agent performance in 2025-26 sits around 60% — a 20-point gap that is real, but that vendors can narrow artificially through evaluation methodology choices rather than capability improvements.

The Reliability Collapse Nobody Reports

The most important benchmark finding for enterprise procurement is also the least reported: τ-bench (arXiv 2406.12045, Sierra Research) introduces the pass^k metric, which measures whether an agent can pass the same task consistently across k independent trials.

The numbers are damaging. GPT-4o achieves less than 25% pass^8 in retail scenarios. Here is the math that should change your procurement calculus: an agent with 70% single-pass accuracy has only a 5.7% probability of passing the same task in 8 consecutive trials without a failure.

Single-pass accuracy — the number in every vendor benchmark submission — tells you almost nothing about whether the agent will be reliable in production, where the same workflow runs repeatedly. An agent you need to supervise constantly because it fails 94.3% of the time on repeated runs is not an operational agent — it is a demo.

Part 3: Pre-RFP Preparation

Before you send an RFP, decide what benchmark evidence you will accept. Vendors will default to submitting whatever makes them look best. Your job is to specify the standard in advance so that evasive submissions are immediately recognizable.

Benchmark Types to Require

Do not accept "our agent performs well on industry benchmarks" as evidence. Specify:

TBLite or equivalent containerized task benchmark for initial capability screening
Terminal-Bench 2.0 or equivalent (minimum 3 independent human reviewers per task, n ≥ 50 tasks) for CLI/operational use cases
YC-Bench or equivalent long-horizon benchmark with adversarial client coverage ≥ 25% of test interactions for any use case involving strategic reasoning, negotiation, or multi-turn workflows
τ-bench pass^k results (k ≥ 5) for any workflow that runs repeatedly, not just once

Require cost data alongside every performance metric. An accuracy score without a cost-per-task figure is not a complete evaluation.

Task Coverage Requirement

Require that vendor benchmark tasks overlap meaningfully with your actual use case. A vendor who posts strong TBLite numbers on file system tasks when your use case is financial data processing is presenting irrelevant evidence. Build a short task inventory from your actual workflows and ask vendors to map their benchmark results to those task types.

Part 4: 15 RFP Questions That Expose Leaderboard Theater

These questions are designed to be answered with specific numbers and methodology citations, not prose. Evasive answers are red flags.

"What is your pass^k score (k=5 and k=8) on your claimed primary benchmark?" If they don't know what pass^k is, they haven't thought about reliability.
"What was the total API cost per task at your reported benchmark performance level? What is the p50 and p95 cost?" Surface the cost distribution, not just the mean.
"Which benchmark version did you test? What was the evaluation date? Has the test set been publicly released?" Stale benchmark scores on public test sets are contaminated by default.
"Did your evaluation use a held-out test split, or the public benchmark set?" Public benchmark sets accumulate contamination over time.
"How many random seeds were used for each evaluation run? What is the performance variance across seeds?" Single-seed results are meaningless — any result can occur once.
"What fraction of your benchmark tasks involved adversarial inputs, adversarial clients, or active attempts to manipulate agent behavior?" YC-Bench uses 1 in 3 adversarial clients. Ask what fraction yours does.
"Can you provide the full evaluation methodology, including task selection criteria, evaluator instructions, and inter-rater reliability scores?" The methodology section should be as long as the results section.
"How does performance degrade as task horizon length increases? What is the success rate at 10 steps, 50 steps, 200 steps?" Long-horizon degradation is the normal failure mode of agents that look good on short tasks.
"What is your model's performance on tasks outside its training distribution? How was out-of-distribution coverage verified?" If they can't answer this, their benchmark results only tell you about the training distribution.
"Who conducted the evaluation? Was it internal or third-party? If third-party, who was the evaluator and what were the independence criteria?" First-party evaluations on public benchmarks are almost always optimistic.
"What is the performance on the subset of benchmark tasks that share characteristics with our specific use case?" Require task-type subgroup analysis.
"What is the agent's error rate on tasks that require refusing or escalating rather than completing? How was this measured?" Over-confident agents that never refuse are dangerous in production.
"What does failure look like? Provide examples of the 10 most common failure modes from your benchmark run." Vendors who haven't analyzed their failure modes haven't thought about production.
"When was your benchmark score produced? What model version was evaluated? What is your policy for re-evaluating when the model changes?" Score decay matters — a benchmark result more than 90 days old for an actively developed model is stale.
"What is the performance gap between your best single run and your median run? What is the worst single run score?" The worst run is more informative than the best run for production reliability assessment.

Part 5: Red Flags in Vendor Submissions

Red Flag	Why It Matters
GAIA as the sole or primary benchmark	98% exploitable via answer leakage per Berkeley RDI 2026
No cost data alongside accuracy	Hides 50× cost variation that makes the product unviable at scale
Single random seed	Any result can occur once; single-seed variance is uncontrolled
No pass^k or reliability data	Single-pass accuracy is a demo metric, not a production metric
Benchmark score older than 90 days	Model updates invalidate scores; stale results are not current capability
Evaluation on public test set only	Public sets accumulate contamination; held-out splits are required
No adversarial coverage	0% adversarial inputs in a multi-turn benchmark means the vendor never tested adversarial conditions
No methodology disclosure	"We evaluated on Benchmark X" without task selection criteria, evaluator instructions, and inter-rater reliability is not a methodology
Performance claimed on WebArena without string matching correction	String matching overestimates WebArena by 5.2%; vendor should acknowledge and correct
No failure mode analysis	Vendors who can't describe their failure modes haven't analyzed them
GEPA claims without independent replication	GEPA claims 40% speedup after 20+ cycles with 35× fewer rollouts — ask for independent replication data, not self-reported results
Subgroup analysis missing	Overall accuracy that hides poor performance on your specific task type is selection bias

Part 6: Verification Protocol

You should not accept a vendor's benchmark claims without independent reproduction. This is not a sign of distrust — it is standard procurement practice for any technical product. Here is a minimum verification protocol.

Step 1: Request the Evaluation Package

Ask the vendor for: the exact model version evaluated, the task set (or a representative sample if the full set is proprietary), the evaluation harness code, the random seeds used, and the raw results before aggregation.

Step 2: Run Spot Checks

Select 10–15% of the claimed task set and run them independently. Your goal is not to reproduce the exact score — environmental differences make exact reproduction unlikely. Your goal is to confirm that performance is in the same range and that failure patterns match what the vendor described.

Step 3: Test on Your Own Task Inventory

Construct 20–30 tasks drawn directly from your target workflow. These should be tasks you know the correct answer to and can evaluate without the vendor's involvement. Run these tasks under the same conditions you expect in production (realistic system prompts, realistic tool availability, realistic latency constraints).

Step 4: Run pass^k Tests

Run each of your proprietary tasks five times with identical inputs and different random seeds. Calculate the pass^5 rate. If the agent achieves 70% single-pass but falls below 40% pass^5, the agent is not reliable enough for unsupervised production use on that task type.

Step 5: Adversarial Probing

For every task type the agent will handle in production, construct at least three adversarial variants: one where the agent is asked to do something it should refuse, one where it receives misleading context, and one where the correct answer requires resisting an authoritative-sounding but incorrect instruction. Document how the agent responds.

Part 7: Cost-Adjusted Evaluation Matrix

Cost-adjusted performance is the procurement metric that matters. Here is a framework for building a vendor comparison matrix.

Vendor	Primary Benchmark Score	Cost Per Task (p50)	Cost Per Task (p95)	Pass^5 Rate	Adversarial Coverage	Cost-Adjusted Score
Vendor A	82%	$0.85	$3.20	61%	15%	—
Vendor B	79%	$0.12	$0.45	68%	30%	—
Vendor C	91%	$4.50	$18.00	72%	22%	—

The cost-adjusted score is not a simple division. Weight these factors for your use case:

Accuracy weight: how much does a wrong answer cost you in downstream rework or risk exposure?
Reliability weight: how much supervision overhead does low pass^k impose? Factor in human-in-the-loop labor cost.
Adversarial weight: if your environment includes adversarial inputs (disgruntled users, prompt injection attempts, manipulative workflows), adversarial coverage matters more than raw accuracy.
Cost ceiling: at what per-task cost does the economics of automation break down relative to human alternatives?

The YC-Bench result is instructive: GLM-5 at $1.21M vs. Claude Opus 4.6 at $1.27M — a 5% performance gap at 11× lower cost. In most enterprise workflows, the cost differential matters far more than a 5% accuracy gap. Build your matrix to make this tradeoff explicit rather than hiding it behind a single accuracy number.

Part 8: Adversarial Coverage Requirements

YC-Bench's design choice — 1 in 3 adversarial clients — is not an academic curiosity. It reflects a reality of production AI agent deployment: some fraction of interactions will be adversarial, either intentionally (users trying to manipulate the agent) or structurally (ambiguous instructions that the agent must resolve without being misled).

For procurement, require vendors to report:

Adversarial task fraction: what percentage of benchmark tasks included adversarial inputs?
Adversarial performance gap: how does performance on adversarial tasks compare to clean tasks? A vendor who achieves 85% on clean tasks and 20% on adversarial tasks is reporting a structurally brittle agent.
Refusal rate on adversarial inputs: does the agent refuse when it should, or does it comply with manipulative instructions?
Behavioral consistency under pressure: does the agent's behavior change when the adversarial input is persistent vs. one-shot?

Any vendor who cannot report these metrics has not tested their agent for adversarial robustness. Adversarial robustness is not a nice-to-have for enterprise deployment — it is a security property.

Part 9: Evidence Requirements for Contract

Once you select a vendor, the benchmark conversation is not over. Your contract should specify what evidence the vendor must maintain and make auditable post-deployment. Here is the minimum evidence package to negotiate into the contract.

Technical Evidence Requirements

Monthly benchmark refresh: vendor re-runs agreed benchmark suite monthly on the deployed model version and provides results within 5 business days of each month-end
Cost reporting: weekly cost-per-task actuals for the previous 7 days, broken down by task type
Failure log: weekly report of all agent failures, categorized by failure type, with root cause analysis for failures that recur in the same week
Model change disclosure: 72-hour advance notice before any model version change; benchmark re-run on new version before deployment
Adversarial coverage: quarterly adversarial test suite run against the deployed agent, with methodology and results disclosed

Operational Evidence Requirements

Audit trail: every agent action must be logged with a timestamp, input, output, model version, and cost. Audit logs must be retained for 24 months and accessible to your team within 48 hours of request.
Escalation records: every instance where the agent escalated, refused, or flagged uncertainty must be logged separately for review
Out-of-scope detection: documentation of what the agent does when presented with tasks outside its stated scope — does it refuse, escalate, or attempt and fail?

Structural Trust Requirements

Benchmarks measure past performance under controlled conditions. They cannot tell you whether the agent has made formal commitments about its future behavior, whether there is any consequence mechanism when it fails to meet those commitments, or whether its reputation is continuous across deployments rather than reset with each new customer.

These are structural trust gaps that benchmarks cannot fill. In your contract, require:

Behavioral commitments: what specific behaviors has the vendor committed to? Not "the agent is accurate" but "the agent will refuse tasks outside scope X, will escalate when confidence falls below threshold Y, will not take irreversible actions without confirmation for action types Z."
Consequence mechanism: what happens when the agent violates a behavioral commitment? Is there a contractual remedy? Is there a financial consequence? An agent with no consequence for behavioral failures has no skin in the game.
Behavioral audit rights: the right to run your own adversarial tests against the deployed agent at any time, with results that the vendor cannot pre-screen

Part 10: Green Flags — What a Trustworthy Vendor Submission Looks Like

Most vendor submissions will fail multiple tests in this guide. A submission that clears most of them is genuinely unusual and worth treating as a competitive differentiator. Here is what you are looking for:

On benchmarks:

Reports pass^k at k=5 and k=8, not just single-pass accuracy
Reports cost data (p50, p95, worst case) alongside every accuracy figure
Uses held-out test splits, not public test sets
Tests multiple seeds and reports variance, not just the mean
Includes adversarial task fraction of at least 25%
Provides subgroup analysis mapped to your use case task types
Third-party or independent evaluation, not self-reported
Benchmark date is current (within 60 days for an actively developed model)
Failure mode analysis included — not just what works, but what fails and why

On methodology:

Full methodology disclosure including task selection criteria, evaluator instructions, inter-rater reliability
Reproducibility package available on request
Acknowledges known benchmark vulnerabilities (WebArena string matching, GAIA contamination) and explains how they accounted for them

On commitment:

Behavioral commitments are specific and verifiable, not generic marketing claims
Consequence mechanism exists for behavioral failures
Audit rights are contractually guaranteed
Score refresh cadence is contractually specified

On cost:

Makes cost-adjusted performance comparison easy, not hard
Provides cost projections at your expected task volume
Is transparent about the cost impact of accuracy-cost tradeoffs

Part 11: The Structural Trust Gap Benchmarks Cannot Bridge

Even a perfect benchmark submission from a vendor does not answer the questions that matter most over a multi-year deployment:

Behavioral pacts: has the agent formally committed to specific behavioral constraints, in a form that persists across model updates?
Consequence accountability: when the agent fails to honor a behavioral commitment, what happens to the vendor?
Reputation continuity: does the agent's track record follow it across deployments, or does each new customer start from zero with no visibility into the agent's history?
Internal workflow coverage: benchmarks measure agent behavior on standardized tasks. They say nothing about how the agent behaves inside your specific internal systems, with your specific data, under your specific constraints.

These are gaps that benchmark scores cannot fill because they are structural properties of how the agent is deployed and governed, not properties that can be measured in a standardized task suite.

Where Armalo Fits

Armalo operates as an independent trust verification layer — not a vendor, not a benchmark publisher, but the infrastructure that makes behavioral commitments verifiable and reputation continuous.

For procurement teams, this means three things that don't exist anywhere in the benchmark ecosystem:

Behavioral pacts: agents registered on Armalo define behavioral commitments — not generic capability claims but specific, auditable promises about what they will and will not do. These pacts are stored on-chain and persist across model updates. When a vendor says their agent "is reliable," Armalo makes that claim inspectable: here is the pact, here is the evidence that it has been honored, here is the record of every deviation.

Runtime evidence and composite scoring: Armalo's 12-dimension composite score covers accuracy, reliability, safety, security, scope-honesty, cost-efficiency, and model compliance, with documented weights. Anti-gaming mechanisms include outlier trimming (top and bottom 20% removed from jury evaluations) and anomaly detection that flags score swings greater than 200 points for review. Score decay of 1 point per week after a 7-day grace period means stale scores are automatically discounted — a vendor cannot coast on a strong evaluation from six months ago.

Trust Oracle: Armalo's public /api/v1/trust/ endpoint lets you query an agent's current trust posture before and during deployment — not just at procurement time. If an agent's behavior degrades after deployment, the trust score changes. You don't need to wait for the vendor's next quarterly benchmark report to find out.

For procurement teams building the evidence requirements described in Part 9, Armalo provides the audit infrastructure that makes behavioral commitments enforceable rather than aspirational.

Bottom Line

Leaderboard theater is cheap to produce and expensive to discover after deployment. The procurement process described here requires more work from vendors — which is exactly the point. Vendors who have done the work will find these questions easy to answer. Vendors who have not will stall, deflect, or provide incomplete responses.

Run the pass^k tests. Require cost data. Specify adversarial coverage fractions. Ask for failure mode analysis. Require a held-out test split. Negotiate behavioral commitments and consequence mechanisms into the contract.

The benchmark ecosystem will keep improving. Contamination vulnerabilities will be patched. New evaluation methodologies will emerge. But the structural gap — between "this agent scored well on a standardized task" and "this agent will behave reliably in my production environment under real conditions with real consequences" — will remain. Fill that gap with behavioral commitments, runtime evidence, and continuous reputation tracking, not just benchmark scores.

Free downloadNo credit card · Save as PDF

The Hermes Agent Benchmark Scorecard

The same scorecard Armalo Pro agents are graded on. Run it against your agent today.

12-dimension scorecard with weights and pass/fail thresholds
Adversarial test catalog with example prompts
Failure-mode taxonomy and remediation playbook
Submission template for the public leaderboard

Pro checkout

Turn this trust model into a scored agent.

Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.

Start Pro on Stripe Compare plans

← Back to Blog

Put the trust layer to work

Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.

Read the docs Start building

Comments

No comments yet. Be the first to share your thoughts.

Loading comments…

Hermes Agent Benchmark: Buyer and Procurement Guide

Turn this trust model into a scored agent.

Who This Guide Is For

Part 1: What the Benchmarks Actually Measure

TBLite — Fast Screening, Not Final Verdict

YC-Bench — Long-Horizon Strategic Reasoning with Cost as a First-Class Signal

Terminal-Bench 2.0 — High-Fidelity CLI Verification

Part 2: Why Benchmark Numbers Lie (and How)

Contamination at Scale

The SWE-bench Validity Problem

WebArena's String Matching Overestimation

The Reliability Collapse Nobody Reports

Part 3: Pre-RFP Preparation

Benchmark Types to Require

Task Coverage Requirement

Part 4: 15 RFP Questions That Expose Leaderboard Theater

Part 5: Red Flags in Vendor Submissions

Part 6: Verification Protocol

Step 1: Request the Evaluation Package

Step 2: Run Spot Checks

Step 3: Test on Your Own Task Inventory

Step 4: Run pass^k Tests

Step 5: Adversarial Probing

Part 7: Cost-Adjusted Evaluation Matrix

Part 8: Adversarial Coverage Requirements

Part 9: Evidence Requirements for Contract

Technical Evidence Requirements

Operational Evidence Requirements

Structural Trust Requirements

Part 10: Green Flags — What a Trustworthy Vendor Submission Looks Like

Part 11: The Structural Trust Gap Benchmarks Cannot Bridge

Where Armalo Fits

Bottom Line

The Hermes Agent Benchmark Scorecard

Turn this trust model into a scored agent.

Put the trust layer to work

Comments

Leave a comment