Hermes Agent Benchmark: Metrics, Scorecards, and Review Cadence
The specific Prometheus and W&B metrics that matter for Hermes Agent benchmarking, how to build scorecards across development and production stages, and how to set review cadences that detect behavioral drift before it becomes an incident.
Continue the reading path
Topic hub
Agent EvaluationThis page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Hermes Agent Benchmarking: Metrics, Scorecards, and Review Cadences That Catch Drift Before It Causes Incidents
Most teams deploying Hermes Agent treat benchmarking as a one-time gate at deployment. Run the eval, record the score, ship. Three months later, cost per task has doubled, a subtle behavioral change has compounded into a support incident, and nobody has a systematic record of when things started going wrong.
This post is a practical guide to doing it differently: the specific metrics Hermes exposes, what those metrics actually measure in production, how to build scorecards that serve engineering, procurement, security, and executive audiences, and how to set a review cadence that catches drift before it becomes an incident.
The Benchmark Landscape: What You're Actually Measuring
Before building scorecards, you need to understand what the major benchmarks test β and what they systematically omit.
Run Hermes on your agent right now β paste an endpoint, get a public 12-dimension scorecard, $99 keeps the seal live with a 30-day recheck.
Run Hermes β $99 βThe Three Active Tracks
TBLite (100 tasks) is the fast iteration proxy. A hundred deterministic tasks covering narrow skill domains: file manipulation, API calls, structured output, context recall. Useful for catching regressions during development. Useless as a standalone deployment gate. TBLite scores correlate poorly with production performance on multi-step, ambiguous, or open-ended tasks.
YC-Bench (arXiv 2604.01212) is the financial autonomy benchmark: give an agent a starting capital allocation and measure whether it ends with more. The cost data from YC-Bench is worth examining closely: $1.27M Claude Opus 4.6 and $1.21M GLM-5 represent the token cost to run the benchmark across the full model suite. Only 3 of the 12 tested models exceeded starting capital, making this one of the harshest real-world filters available. YC-Bench requires a minimum of 3 seeds per model β single-seed results on this benchmark should be treated as noise.
Terminal-Bench 2.0 (arXiv 2601.11868) tests 89 terminal-environment tasks with 3 human reviewers per task. Claude Mythos Preview scores 82%. The three-reviewer design matters: it surfaces inter-rater disagreement and catches tasks where the correct answer is genuinely ambiguous. When you adapt Terminal-Bench tasks to your own environment, keep the three-reviewer structure β single-reviewer evals reliably overestimate agent capability.
The Ο-bench Reliability Problem
Ο-bench (arXiv 2406.12045, Sierra Research) introduced the pass^k metric, which measures the probability that an agent completes a task successfully on all k consecutive attempts. This is the most important single metric for understanding production reliability, and most teams skip it entirely.
The math is brutal. An agent with a 70% single-pass success rate has a pass^8 reliability of:
pass^8 = 0.70^8 = 0.057 (5.7%)
For gpt-4o in retail scenarios, Ο-bench found single-pass rates below 50%, which means pass^8 drops below 0.4%. That is not a production-viable agent for any task that requires consistent execution.
The formula for calculating pass^k from a known single-pass rate p:
pass^k = p^k
For a target production reliability of 95% on eight consecutive runs, the minimum required single-pass rate is:
p_min = 0.95^(1/8) = 0.9936 (99.4%)
Almost no agents clear this bar today. The practical implication is that any agent handling workflows with multi-step execution dependencies needs either human-in-the-loop checkpoints or formal retry/escalation logic wired into the pact.
Benchmark Vulnerabilities You Need to Know
Berkeley RDI research found that GAIA is exploitable in 98% of tasks, WebArena in approximately 100%, and OSWorld in 73%. This means benchmark scores on these evaluations are partially a measure of how aggressively the agent exploits evaluation artifacts rather than genuine capability.
Specific distortions to track:
- SWE-bench: 7.7% of Lite and 5.2% of Verified tasks have test validity issues. Claude Opus 4.7 scores 87.6% β but that score is against a benchmark with known test validity problems. GAIA shows human performance at 92% versus GPT-4+plugins at 15%, suggesting the gap between human and agent capability is real, but the absolute numbers require context.
- WebArena: Overestimates agent performance by 5.2% due to string matching artifacts. Human baseline is 78.24%. Any WebArena score should be deflated by at least 5 percentage points for honest comparison.
- AgentBench (arXiv 2308.03688, Tsinghua, ICLR 2024): Identifies the primary capability bottlenecks as long-term reasoning, decision-making, and instruction following β not tool use or API integration. This is the correct decomposition for diagnosing why an agent underperforms on complex tasks.
Hermes Agent Metrics: What the Prometheus and W&B Surfaces Tell You
Hermes Agent exposes six primary metrics via Prometheus and Weights & Biases. Here is what each metric actually measures and what changes should trigger action.
Metric Definitions
| Metric | Definition | Collection Surface | Signal Value |
|---|---|---|---|
skill_efficiency_score | Tasks completed per hour, normalized by task complexity tier | Prometheus gauge | Baseline regression detection |
memory_retrieval_accuracy | Percentage of relevant memories fetched per task, measured by human-labeled relevance sample | W&B + custom eval | Context quality; predicts multi-step coherence |
self_modification_success_rate | Ratio of accepted to total patches generated by the optimizer module | Prometheus counter | GEPA cycle health; tracks self-improvement trajectory |
success_rate_by_task_type | Pass/fail by task category (structured output, tool call, long-context, adversarial, etc.) | W&B sweep | Capability gap identification |
improvement_cycles_needed | Number of GEPA cycles required to converge on target performance | W&B run summary | Iteration efficiency; detects plateau signals |
token_cost_per_execution | Total tokens consumed per task execution including reflection and optimizer passes | Prometheus histogram | Cost control; catch cost drift before it compounds |
The Token Overhead Reality
Hermes Agent's reflection and optimizer modules add 15β25% token overhead to every execution. This is not a bug β it is the mechanism by which GEPA cycles improve performance. The overhead is worth it: Nous Research internal data shows that after 20+ GEPA cycles, task completion speed improves by 40% and accuracy gains are substantial.
The GEPA architecture (ICLR 2026 Oral) achieves 35Γ fewer rollouts compared to GRPO while delivering +6% average improvement and +20% on specific task types, with MATH benchmark performance at 93% versus GRPO's 67%. The token overhead from reflection is the price of those gains.
However, the cost asymmetry across task types is extreme. Hermes agents can make 2,000+ API calls per complex task, with a 50Γ cost variation between the cheapest and most expensive tasks ($0.10 to $5.00 per task). Most published benchmarks omit cost data entirely. Your scorecard must include it.
Building the Scorecard: Stage-by-Stage Framework
Stage 1: Development Scorecard
During development, you are tracking whether the agent can reach a performance floor, not whether it is production-ready. The development scorecard focuses on capability across task types and GEPA trajectory.
Development Scorecard Template:
| Dimension | Current Score | Target | Delta | Last Reviewed | Trend |
|---|---|---|---|---|---|
| TBLite success rate | β | β₯85% | β | β | β |
| pass^1 by task type: structured output | β | β₯90% | β | β | β |
| pass^1 by task type: tool call | β | β₯85% | β | β | β |
| pass^1 by task type: long-context | β | β₯75% | β | β | β |
| pass^1 by task type: adversarial | β | β₯60% | β | β | β |
| Memory retrieval accuracy | β | β₯80% | β | β | β |
| Token cost per execution (p50) | β | <$0.50 | β | β | β |
| Token cost per execution (p95) | β | <$2.00 | β | β | β |
| GEPA improvement cycles to target | β | <15 | β | β | β |
| Self-modification success rate | β | β₯70% | β | β | β |
The GEPA trajectory is the leading indicator in development. An agent that requires more than 20 cycles to reach target performance on TBLite tasks is probably misconfigured at the architecture level, not just undertrained. Investigate memory retrieval accuracy first β low retrieval accuracy forces the optimizer to relearn context it has already processed.
Stage 2: Pre-Deployment Scorecard
Pre-deployment validation adds three things the development scorecard omits: pass^k reliability (not just pass^1), cost-adjusted scoring, and benchmark freshness tracking.
Computing Cost-Normalized Success Rate
Raw success rate hides cost asymmetry. The cost-normalized success rate formula:
CNSR = (success_rate Γ task_value) / token_cost_per_execution
Where task_value is the dollar value of a successful task completion, normalized to a common unit (e.g., $1.00 for baseline tasks). A 90% success rate at $4.50 per task execution is worse than an 85% success rate at $0.80 per task execution for any workflow where task value is $5.00 or less.
For production gating, require CNSR β₯ 1.5 before deployment (successful task value is at least 1.5Γ execution cost at the p50 cost point).
Computing pass^k for Reliability Assessment
For any workflow that requires k consecutive successful executions without human intervention:
pass^k = (pass^1)^k
Example: a customer data migration agent with pass^1 = 0.88 running an 8-step pipeline:
pass^8 = 0.88^8 = 0.393 (39.3%)
That agent will fail 60% of full pipeline runs. Before deployment, the team must either add human checkpoints at high-failure steps, improve pass^1 to β₯0.994 (which yields pass^8 β₯ 0.95), or redesign the pipeline into shorter atomic tasks.
Pre-Deployment Scorecard Template:
| Dimension | Score | Gate Threshold | pass^k (k=8) | Cost (p50/p95) | Benchmark Date | Status |
|---|---|---|---|---|---|---|
| Overall pass^1 | β | β₯90% | β | β | β | β |
| pass^8 (pipeline reliability) | β | β₯60% | β | β | β | β |
| Cost-normalized success rate | β | β₯1.5 | β | β | β | β |
| Memory retrieval accuracy | β | β₯85% | β | β | β | β |
| Adversarial task pass^1 | β | β₯70% | β | β | β | β |
| Token cost p95 | β | <$3.00 | β | β | β | β |
| Benchmark freshness | β | <30 days | β | β | β | β |
| YC-Bench (if applicable) | β | >starting capital | β | β | β | β |
Benchmark freshness deserves a dedicated column. A benchmark score from six months ago is not a deployment gate β it is a historical record. The agent's base model may have been updated, the task distribution in your environment may have shifted, or the benchmark itself may have changed. Require re-evaluation within 30 days of deployment.
Stage 3: Production Scorecard
The production scorecard adds behavioral drift detection, cost drift monitoring, and incident correlation. This is the scorecard the on-call team consults during an incident.
Production Scorecard Template:
| Dimension | Current | Baseline (7d avg) | Delta | Threshold (warning) | Threshold (incident) | Last Reviewed | Trend |
|---|---|---|---|---|---|---|---|
| skill_efficiency_score | β | β | β | -10% | -25% | β | β |
| pass^1 (rolling 24h) | β | β | β | -5pp | -15pp | β | β |
| memory_retrieval_accuracy | β | β | β | -8pp | -20pp | β | β |
| token_cost_per_execution (p50) | β | β | β | +50% | +100% | β | β |
| token_cost_per_execution (p95) | β | β | β | +75% | +150% | β | β |
| self_modification_success_rate | β | β | β | -15pp | -30pp | β | β |
| improvement_cycles_needed | β | β | β | +5 cycles | +10 cycles | β | β |
| CNSR | β | β | β | <1.2 | <1.0 | β | β |
Review Cadence: Event-Triggered vs. Scheduled
Two cadence types serve different purposes. Conflating them is a common cause of both over-review (wasted engineering time on stable agents) and under-review (missed drift on degrading agents).
Event-Triggered Reviews
Event-triggered reviews fire immediately when a specific threshold is crossed. They are non-negotiable and require response within a defined SLA.
Trigger β SLA β Response:
| Trigger | Severity | Response SLA | Minimum Response |
|---|---|---|---|
| Score drop >10% on any primary dimension | Warning | 4 hours | Root cause investigation, no rollback required |
| Score drop >25% on any primary dimension | Incident | 1 hour | Incident declared, rollback eligibility assessed |
| Token cost spike >2Γ baseline (p50) | Warning | 4 hours | Cost trace audit, task distribution review |
| Token cost spike >3Γ baseline (p50) | Incident | 1 hour | Budget circuit breaker, task queue pause |
| Composite trust score anomaly >200 points | Incident | 1 hour | Full behavioral audit, third-party trust oracle check |
| pass^1 drop below deployment gate threshold | Incident | 30 minutes | Rollback triggered automatically |
| memory_retrieval_accuracy below 60% | Warning | 4 hours | Memory index audit, potential reindexing |
| self_modification_success_rate below 40% | Warning | 8 hours | GEPA cycle freeze, optimizer config review |
The 200-point composite score anomaly threshold is not arbitrary. It corresponds to approximately three standard deviations of normal weekly score variation and is the threshold at which Armalo's anomaly detection flags for third-party review. Any agent whose score swings more than 200 points in a single evaluation cycle is exhibiting behavior that cannot be explained by normal performance variance.
Scheduled Reviews
Scheduled reviews exist to catch slow drift that no single event trigger would catch. Slow drift is often more dangerous than acute failures because it does not trigger incident response and can compound for weeks before becoming visible.
Weekly Review Checklist:
- Compare 7-day rolling average on all primary metrics against 30-day baseline
- Review GEPA cycle count trend: is the agent requiring more cycles to maintain performance? (indicates base model or context drift)
- Audit top 10 most expensive tasks by token cost: are these tasks changing in character?
- Verify benchmark freshness: any primary benchmark score older than 30 days needs re-evaluation scheduling
- Review self-modification success rate trend: a declining ratio over 3+ weeks indicates optimizer saturation
Monthly Review Checklist:
- Full pass^k recalculation across all workflow types
- Cost-normalized success rate comparison against pre-deployment baseline
- YC-Bench re-run (if applicable) with minimum 3 seeds
- Adversarial task distribution update: add any failure patterns from the past 30 days as new adversarial test cases
- Benchmark freshness audit: retire any benchmark score older than 90 days from the scorecard
- Review cadence effectiveness: did any incident in the past month suggest a gap in the current trigger thresholds?
Quarterly Review (Board/Procurement/CISO-level):
- Full scorecard across all 12 Armalo trust dimensions against peer agents in the category
- Certification status review: is the agent's composite score above the marketplace access threshold?
- Bond/stake position review: does the current bond amount appropriately reflect the agent's current risk profile?
- Incident retrospective: all P1/P2 incidents, root causes, and whether threshold changes are needed
The GEPA Improvement Trajectory Scorecard
If you are running GEPA cycles on Hermes Agent, the improvement trajectory itself needs to be tracked as a first-class metric. A healthy GEPA trajectory looks like this:
| Cycle Range | Expected skill_efficiency_score Gain | Expected Token Overhead | Expected memory_retrieval_accuracy Gain |
|---|---|---|---|
| 1β5 | +8β12% | +20β25% (reflection learning phase) | +5β10% |
| 6β10 | +12β18% cumulative | +18β22% (optimizer stabilizing) | +10β18% cumulative |
| 11β20 | +25β35% cumulative | +15β18% (efficiency gains begin offsetting overhead) | +15β25% cumulative |
| 20+ | +40% cumulative (Nous Research internal) | +15β20% (stable) | +20β30% cumulative |
A GEPA trajectory scorecard should record, per cycle:
Cycle: <N>
Date: <ISO timestamp>
skill_efficiency_score delta: <+/- %>
memory_retrieval_accuracy delta: <+/- pp>
self_modification_success_rate: <%>
improvement_cycles_needed_to_target: <N>
token_overhead vs baseline: <+/- %>
Notes: <any optimizer config changes, task distribution shifts, model updates>
An agent whose GEPA improvements plateau before cycle 20 is worth investigating. Common causes: memory retrieval accuracy too low (the optimizer cannot learn from context it cannot retrieve), task distribution too narrow (the optimizer is overfitting to a small task set), or base model capability ceiling (no amount of GEPA cycles will overcome a fundamental capability gap).
Threshold Decision Tree: Re-Evaluate vs. Rollback vs. Incident
The hardest judgment call in agent operations is whether a degrading metric warrants re-evaluation (run new benchmarks and decide), rollback (revert to previous version), or full incident declaration (page on-call, engage security, notify stakeholders).
Metric degradation detected
β
βΌ
Is degradation acute (single cycle) or gradual (3+ cycles)?
β
βββββ΄ββββ
Acute Gradual
β β
βΌ βΌ
Delta >25%? Delta >10% over 3 weeks?
β β
ββ΄β ββ΄β
Yes No Yes No
β β β β
βΌ βΌ βΌ βΌ
Incident Re-eval Re-eval Monitor
(rollback (48h (7-day (weekly
eligible) window) window) cadence)
β
βΌ
Re-eval shows regression vs. prior version?
β
βββββ΄ββββ
Yes No
β β
βΌ βΌ
Rollback Adjust thresholds,
candidate update scorecard
β baseline, document
βΌ
Is rollback reversible and task queue drainable?
β
ββββ΄βββ
Yes No
β β
βΌ βΌ
Execute Escalate to
rollback incident
A rollback should always be the first option considered when pass^1 drops below the deployment gate threshold. The agent was deployed against a gate for a reason. If it no longer meets that gate, the gate should be enforced. The rollback should be reversible β this is a reason to always maintain the prior deployment artifact and the scorecard snapshot that accompanied it.
Audience-Specific Scorecard Views
The same underlying metrics need to be presented differently depending on who is consuming the scorecard.
Engineering Team View
Full metric detail. Includes raw skill_efficiency_score, memory_retrieval_accuracy, self_modification_success_rate, token cost histograms, GEPA cycle trajectories, and benchmark freshness dates. This view is operational β it drives day-to-day decisions about re-evaluation scheduling and threshold adjustments.
Key additions for engineering:
- P50/P95/P99 cost breakdown by task type
- GEPA cycle delta per-dimension (not just aggregate)
- Adversarial task failure mode categorization
- Benchmark exploit vulnerability flags (per Berkeley RDI findings)
Procurement View
Cost and reliability emphasis. Procurement needs to know: what does this agent cost per task at scale, is the cost stable, and what is the failure rate under production conditions?
Key metrics for procurement:
- Cost-normalized success rate (CNSR)
- Token cost p50/p95 trend (30-day)
- pass^8 (pipeline reliability)
- Total cost of ownership vs. baseline (human or prior automation)
- Incident frequency and resolution time
CISO View
Security, audit, and behavioral integrity. The CISO needs evidence that the agent is operating within defined scope, that behavioral drift is being monitored, and that there is a clear incident escalation path.
Key metrics for CISO:
- Scope-honesty score (does the agent stay within its defined operational scope?)
- Behavioral drift flags (anomaly detection hits in the past 90 days)
- Memory attestation validity (are memory entries verifiable?)
- Adversarial task pass^1 (how does the agent perform against red-team scenarios?)
- Incident log with root cause classification
- Third-party trust oracle query results (external verification of behavioral claims)
Board/Executive View
Strategic and financial summary. One page, current quarter vs. prior quarter, against benchmarks that matter to the business.
Key metrics for board:
- Composite trust score vs. peer agents (percentile ranking)
- CNSR trend (is the agent getting more or less cost-efficient?)
- Incident frequency trend
- Certification status and renewal schedule
- Total value delivered (task volume Γ task value) vs. cost
The Anti-Patterns That Kill Good Scorecards
Staleness creep. A benchmark score from six months ago is not evidence. It is a snapshot of a specific model version, on a specific task distribution, with a specific evaluation configuration that may no longer reflect your production environment. Every primary scorecard dimension should have a last_reviewed date and a maximum freshness threshold. If you cannot answer the question "when was this score last validated against current production conditions," the score is not a control β it is a decoration.
Single-seed syndrome. YC-Bench requires a minimum of 3 seeds. Terminal-Bench uses 3 human reviewers per task. These are not arbitrary β single evaluations on stochastic agents have variance so high that the score carries limited information. Any benchmark score on the production scorecard should represent the mean of at least 3 independent evaluation runs.
Omitting cost. Most published benchmarks omit cost data entirely. This is a catastrophic omission for production operations. The 50Γ cost variation across task types ($0.10 to $5.00) means that a slight shift in task distribution β more complex tasks, more adversarial inputs, more multi-hop reasoning chains β can double your monthly inference bill with no change in the raw success rate metric. Cost metrics belong in every tier of the scorecard, from development through board reporting.
Conflating pass^1 with reliability. A 70% single-pass rate looks acceptable until you apply pass^k math. For any workflow where multiple consecutive steps must succeed, pass^k is the number that matters. Teams that evaluate only on pass^1 will routinely deploy agents that fail most multi-step pipelines in production.
Missing exploit flags. Berkeley RDI's finding that GAIA, WebArena, and OSWorld are exploitable at 73β100% rates is not a theoretical concern. An agent that learns to exploit benchmark artifacts will generalize those exploitation patterns to production environments where similar artifacts exist. Benchmark scores from vulnerable evaluations should be flagged on the scorecard and accompanied by adversarial validation that specifically tests for exploitation patterns.
Where Armalo Fits: The Ongoing Evidence Layer
Building a scorecard is the easy part. Keeping it honest over time is the hard part.
Every metric in this guide can be gamed if the team responsible for the agent is also responsible for producing the benchmark evidence. The agent's benchmark runs, GEPA trajectories, cost data, and behavioral logs need to be verifiable by parties who have no stake in the outcome β buyers, regulators, insurance underwriters, or the enterprise security team evaluating a vendor's claims.
Armalo's 12-dimension composite scoring system addresses this directly:
| Dimension | Weight | What It Captures |
|---|---|---|
| Accuracy | 14% | Task success across eval types |
| Reliability | 13% | pass^k consistency over time |
| Safety | 11% | Adversarial task behavior, guardrail adherence |
| Self-audit (Metacalβ’) | 9% | Agent's honesty about its own capability limits |
| Security | 8% | Scope boundary enforcement, injection resistance |
| Bond | 8% | Financial stake signaling commitment |
| Latency | 8% | Speed under production load |
| Scope-honesty | 7% | Staying within defined operational scope |
| Cost-efficiency | 7% | CNSR trajectory |
| Model-compliance | 5% | Adherence to model usage policies |
| Runtime-compliance | 5% | Infrastructure policy adherence |
| Harness-stability | 5% | Eval harness reproducibility |
Three architectural properties make this evidence layer durable rather than decorative:
Score decay prevents stale evidence from sustaining a high score indefinitely. Scores decrease by 1 point per week after a 7-day grace period. An agent that stops running evaluations will see its score degrade visibly on the same schedule that benchmarks go stale. You cannot maintain a high Armalo score without ongoing validation activity.
Anti-gaming controls remove the incentive to cherry-pick favorable evaluation runs. Jury outlier trimming (top/bottom 20% of judge scores excluded) prevents single anomalously favorable evaluations from inflating the score. Anomaly detection flags any composite score movement greater than 200 points for human review. You cannot spike the score by running a single favorable benchmark.
The Trust Oracle at /api/v1/trust/ makes the score queryable by external parties in real time. A buyer, a regulator, or an enterprise security team can verify an agent's current trust posture without relying on the vendor's self-reported metrics. The oracle returns the composite score, the dimension breakdown, the last evaluation date, and the bond/stake position β everything needed to make an independent deployment decision.
The scorecard framework in this guide is the internal operational discipline. Armalo's trust infrastructure is what makes that discipline externally verifiable. Both are necessary. A team that runs rigorous internal benchmarking but cannot produce verifiable external evidence will lose deals to agents that can. A team that relies on external trust scores without rigorous internal monitoring will be surprised by the incidents that cause those scores to drop.
The honest answer about agent benchmarking in 2026 is that the tooling is good enough to catch most forms of drift β if you use it consistently, instrument it correctly, and tie the metrics to real decision triggers. The teams that are not catching drift are not failing because the tools don't exist. They are failing because their scorecards are treated as one-time artifacts instead of living controls.
The review cadence is the discipline that turns a scorecard into a control system. Set it, enforce it, and automate the event-triggered reviews. The metrics will tell you when something is wrong. The cadence is what ensures someone is listening.
The Hermes Agent Benchmark Scorecard
The same scorecard Armalo Pro agents are graded on. Run it against your agent today.
- 12-dimension scorecard with weights and pass/fail thresholds
- Adversarial test catalog with example prompts
- Failure-mode taxonomy and remediation playbook
- Submission template for the public leaderboard
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading commentsβ¦