Hermes Agent Benchmark: Implementation Playbook
A step-by-step implementation guide for Hermes Agent benchmarking β covering Atropos setup, TBLite baseline evaluation, GEPA self-improvement cycles, Terminal-Bench 2.0, YC-Bench long-horizon strategy testing, cost-adjusted analysis, adversarial hardening, and how to package benchmark evidence for production trust decisions.
Continue the reading path
Topic hub
Agent EvaluationThis page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
What this playbook covers
Hermes Agent, released by NousResearch, is one of the most capable open agentic systems available as of mid-2026. It ships with a three-mode evaluation framework (Atropos RL), a self-improvement layer (GEPA), and access to three distinct benchmark tracks that test meaningfully different agent properties. Running all of it without a structured approach wastes compute, produces misleading results, and leaves the evidence in a form that no one outside the original engineering team can use.
This playbook gives you a complete, phase-by-phase implementation guide. Each phase has concrete steps, expected outputs, and the pitfalls that cause teams to misread or misuse results. The final phase shows how to convert benchmark evidence into production trust infrastructure using behavioral pacts and continuous reputation scoring.
Phase 1 β Environment setup
1.1 Install Atropos
Run Hermes on your agent right now β paste an endpoint, get a public 12-dimension scorecard, $99 keeps the seal live with a 30-day recheck.
Run Hermes β $99 βAtropos is the RL backend that powers all three Hermes evaluation modes from a single environment definition. One environment definition drives benchmark evaluation, GRPO training, and SFT data generation β avoiding the common failure mode where eval environments diverge from training environments.
git clone https://github.com/NousResearch/hermes-agent
cd hermes-agent
pip install -e ".[atropos]"
Verify that the three mode flags are all reachable before proceeding:
python -m atropos --mode eval --help
python -m atropos --mode grpo --help
python -m atropos --mode sft --help
If any mode fails, the installation is incomplete. Do not proceed past Phase 1 with a partial install β missing dependencies produce silent failures in later phases that are expensive to diagnose.
1.2 Docker environment preparation
TBLite and Terminal-Bench 2.0 both require containerized task environments. Pull the task images now, before starting any evaluation, so that benchmark runtime is not interrupted by image pulls:
# TBLite images (100 tasks, Terminal-Bench2 subset)
docker pull nousresearch/tblite-tasks:latest
# Terminal-Bench 2.0 task images (89 tasks)
docker pull nousresearch/terminal-bench-2:latest
Verify that your Docker host has at least 40 GB of available disk for all task images plus working volumes. Allocate at least 8 GB RAM per concurrent task container β TB2 tasks can spike significantly during execution.
1.3 Weights & Biases integration
Hermes emits Prometheus metrics and W&B logs natively. Configure W&B before the first evaluation run:
export WANDB_PROJECT="hermes-benchmark"
export WANDB_ENTITY="your-org"
wandb login
Create a dedicated W&B project with tag conventions that will scale across all phases:
phase:baselineβ TBLite first-pass runsphase:gepaβ GEPA improvement cyclesphase:deepβ TB2 and YC-Bench runsseed:1,seed:2,seed:3β per-seed tracking
This tagging structure lets you filter and compare runs across the full evaluation arc without manually reconstructing which run was which.
1.4 Configuration baseline file
Create a benchmark-config.yaml at the root of your evaluation workspace. All phases will inherit from it:
model:
provider: anthropic # or openai, together, etc.
name: claude-sonnet-4-6
temperature: 0.0
evaluation:
seeds: [42, 1337, 7919]
max_concurrent_tasks: 4
timeout_per_task_seconds: 300
cost_tracking:
enabled: true
log_tokens_per_task: true
output:
results_dir:./results
log_level: INFO
Temperature 0.0 is non-negotiable for any benchmark run you intend to compare against published leaderboards. Stochastic sampling produces valid research data but prevents direct comparison with published scores from deterministic runs.
Phase 2 β Baseline evaluation with TBLite
TBLite is the correct starting point. It is a 100-task subset of Terminal-Bench2 created by the OpenThoughts Agent team (Snorkel AI + Bespoke Labs), with all environments available as Docker images on DockerHub. It serves as a fast proxy for TB2's full 89-task suite β completing in roughly one-fifth the time and cost while correlating strongly with full TB2 scores.
2.1 Run TBLite across three seeds
Single-seed evaluation is not a reliable baseline. Ο-bench pass^k β the metric designed to measure agent reliability across repeated trials β requires a minimum of three seeds to produce a stable estimate. Run all three seeds before recording any number as your baseline:
for seed in 42 1337 7919; do
python -m atropos \
--mode eval \
--benchmark tblite \
--config benchmark-config.yaml \
--seed $seed \
--tags "phase:baseline,seed:$seed" \
--output-dir./results/baseline/seed-$seed
done
2.2 Record cost per task
Token cost is not a secondary concern β it is part of the result. The 50Γ cost variation observed across models with similar accuracy scores means that cost-unadjusted comparisons produce misleading rankings. Record cost per task in every run:
python -m hermes.cost_report \
--results-dir./results/baseline \
--output./results/baseline/cost-summary.json
Expected output fields per task:
input_tokensoutput_tokenstotal_cost_usdpass_rate(across seeds)mean_latency_seconds
2.3 Document your methodology
Before running Phase 3, write a one-page methodology document that records: model name and version, temperature, seed values, Docker image tags, Atropos version, and the date of the run. This is not bureaucratic overhead β it is the minimum evidence required for another team to reproduce your results and for a procurement team to evaluate your claims later.
Store this file at ./results/baseline/methodology.md. You will reference it in Phase 8.
2.4 Interpret the baseline
Typical TBLite scores for strong models in 2026 range from 55% to 75%. Below 50% suggests either a configuration problem or a model that is genuinely underperforming on CLI-heavy tasks. Above 75% on TBLite should trigger a sanity check: verify that the evaluation environments are isolated (see Phase 6) before claiming a top-tier result.
| Score range | Interpretation |
|---|---|
| < 40% | Configuration error likely; recheck Docker setup |
| 40β55% | Functional but below median for capable models |
| 55β70% | Competitive; worth proceeding to full TB2 |
| 70β80% | Strong baseline; GEPA improvement likely to yield high absolute gains |
| > 80% | Verify environment isolation before reporting |
Phase 3 β GEPA self-improvement cycle
GEPA (ICLR 2026 Oral) is the self-improvement mechanism integrated into Hermes. It reads execution traces from completed evaluations, identifies underperforming tool usages and skill implementations, and proposes targeted improvements to tool descriptions, system prompts, and skill logic. It requires no additional training runs: GEPA operates entirely through prompt optimization and skill library updates.
Key parameters to internalize before configuring GEPA:
- Requires a minimum of 3 execution examples per tool or skill
- Produces 35Γ fewer rollouts than GRPO to achieve equivalent improvement
- Adds 15β25% token overhead per task due to reflection and optimizer calls
- Average improvement: +6% across tasks, up to +20% on specific task categories
- After 20+ self-created skills: 40% faster average task completion
3.1 Configure the reflection module
GEPA's reflection module needs access to your baseline execution traces:
# gepa-config.yaml
gepa:
trace_source:./results/baseline
min_examples_per_skill: 3
max_improvement_cycles: 5
target_metrics:
- skill_efficiency_score # tasks/hour
- memory_retrieval_accuracy # SQLite FTS5 retrieval precision
- self_modification_success_rate # fraction of proposed changes that improve pass rate
reflection_budget_tokens: 2000
optimizer_budget_tokens: 3000
3.2 Run improvement cycles
Do not run all five cycles in one pass if this is your first GEPA deployment. Run two cycles, inspect the proposed changes, and validate that the optimizer is targeting the right failure modes before committing to the full cycle count:
# Cycle 1-2 with manual review gates
python -m hermes.gepa \
--config gepa-config.yaml \
--cycles 2 \
--review-gate true \
--output-dir./results/gepa
# Inspect proposed changes before continuing
cat./results/gepa/cycle-2/proposed_skill_updates.json
# Remaining cycles
python -m hermes.gepa \
--config gepa-config.yaml \
--cycles 3 \
--resume-from./results/gepa/cycle-2 \
--output-dir./results/gepa
3.3 Track improvement trajectory, not just endpoint
The most valuable output of a GEPA run is not the final score β it is the improvement trajectory across cycles. A system that jumps from 58% to 74% in two cycles then plateaus is a different signal than one that improves steadily from 58% to 74% across all five cycles. The plateau pattern suggests the model has exhausted the gains available from prompt optimization and may need architectural changes or training data augmentation.
Plot the trajectory using the W&B tags you configured in Phase 1:
import wandb
api = wandb.Api()
runs = api.runs("your-org/hermes-benchmark", filters={"tags": {"$in": ["phase:gepa"]}})
# Sort by cycle number, plot pass_rate vs cycle
3.4 Re-evaluate on TBLite after GEPA
After GEPA completes, re-run TBLite with the same three seeds. This gives you a clean before/after comparison on the same benchmark. The Ξ΄ (delta) between baseline and post-GEPA is the evidence you will use in Phase 8 to demonstrate improvement velocity.
3.5 MATH benchmark as a calibration check
If your use case involves quantitative reasoning, run the MATH benchmark evaluation after GEPA. GEPA's published result on MATH is 93% vs. 67% CoT baseline β an unusually large gain that stems from GEPA learning to decompose mathematical sub-steps into explicit skill calls. If you see a MATH improvement below 10 percentage points after GEPA, check that the reflection module had sufficient MATH-type examples in the trace set.
Phase 4 β Deep evaluation
TBLite is a proxy. Deep evaluation means running the two benchmarks that test properties TBLite does not: Terminal-Bench 2.0 for CLI precision and long-running task chains, and YC-Bench for multi-turn strategic reasoning under adversarial pressure.
4.1 Terminal-Bench 2.0
Terminal-Bench 2.0 (arXiv 2601.11868) contains 89 tasks, each reviewed by three human evaluators, all Docker-containerized. Published top scores as of this writing: Claude Mythos Preview 82%, GPT-5.3 Codex 77.3%, GPT-5.4 75.1%. Leaderboard: tbench.ai/leaderboard/terminal-bench/2.0.
TB2 tasks are significantly harder and more time-consuming than TBLite tasks. Budget 2β4Γ the compute time and 3β5Γ the cost per task compared to TBLite.
for seed in 42 1337 7919; do
python -m atropos \
--mode eval \
--benchmark terminal-bench-2 \
--config benchmark-config.yaml \
--seed $seed \
--post-gepa true \
--skill-library./results/gepa/final/skill_library.json \
--tags "phase:deep,benchmark:tb2,seed:$seed" \
--output-dir./results/deep/tb2/seed-$seed
done
TB2's three-reviewer human evaluation means that disagreements between reviewers are part of the dataset. Report the mean pass rate and the inter-rater agreement (Cohen's ΞΊ) alongside your score. A task with ΞΊ < 0.4 is ambiguous and should not be used as a primary performance signal.
4.2 YC-Bench
YC-Bench (arXiv 2604.01212, github.com/collinear-ai/yc-bench) is a fundamentally different evaluation. It simulates a full year of running an AI startup as CEO, with $200K starting capital, hundreds of decision turns, four operational domains (research, inference, data, training), and one-third adversarial clients designed to destabilize the business through misleading signals and hostile requests.
Published benchmark results: Claude Opus 4.6 $1.27M end capital, GLM-5 $1.21M end capital. These numbers are outputs of the simulation, not pass/fail scores.
YC-Bench requires SQLite installation and the collinear-ai CLI:
pip install yc-bench
yc-bench init --workspace./results/deep/yc-bench
Run with minimum 3 seeds β YC-Bench's documented guidance. The capital outcomes have significant variance across seeds due to the adversarial client randomization:
for seed in 42 1337 7919; do
yc-bench run \
--model claude-sonnet-4-6 \
--seed $seed \
--horizon 365 \
--starting-capital 200000 \
--adversarial-fraction 0.33 \
--output./results/deep/yc-bench/seed-$seed
done
What to look for in YC-Bench results:
| Signal | What it indicates |
|---|---|
| High variance across seeds (>30% capital spread) | Agent is sensitive to adversarial framing; inconsistent strategy |
| Adversarial client win rate > 40% | Agent fails to detect hostile patterns; scope honesty issue |
| Capital peak early then decline | Agent does not adapt to changing domain conditions |
| Consistent capital growth across all seeds | Robust strategic reasoning; strong signal for long-horizon deployment |
YC-Bench's decay mechanics β where domains degrade in value over time unless actively maintained β test whether the agent can recognize when its current strategy is becoming stale. This is a proxy for behavioral drift resistance, one of the properties that matters most in production deployments.
Phase 5 β Cost-adjusted analysis
Raw pass rates without cost normalization are misleading. A model scoring 72% at $0.008/task is not comparable to one scoring 74% at $0.40/task. The 50Γ cost variation across models with similar accuracy is real and documented.
5.1 Normalize all scores by API cost
Calculate a cost-adjusted score for every model and benchmark combination:
cost_adjusted_score = pass_rate / (cost_per_task / baseline_cost_per_task)
Where baseline_cost_per_task is the cost of your cheapest evaluated model. This produces a dimensionless efficiency score where the baseline model scores exactly its raw pass rate and more expensive models are penalized proportionally.
5.2 Build a cost-adjusted comparison table
Record all evaluated configurations in a single table. This is the artifact that engineering leadership and procurement teams will actually use:
| Model | TBLite (raw) | TB2 (raw) | YC-Bench capital | Cost/task | Cost-adj TBLite |
|---|---|---|---|---|---|
| Your model (pre-GEPA) | 62% | 58% | $0.84M | $0.018 | 62.0 |
| Your model (post-GEPA) | 71% | 67% | $1.11M | $0.023 | 55.9 |
| Reference: Claude Opus 4.6 | β | β | $1.27M | $0.24 | β |
Note in this example that post-GEPA raw scores improve significantly, but the 15β25% GEPA token overhead reduces the cost-adjusted score. This is the correct trade-off to surface explicitly, not hide.
5.3 SWE-bench Verified as independent validation
For coding-specific agent deployments, run SWE-bench Verified (the curated subset with verified solutions) as an independent check. SWE-bench Verified is the strongest independent validation available for coding agent capability because it uses real GitHub issues with confirmed fixes, not synthetic tasks. Include it in your cost table as a third benchmark axis.
Phase 6 β Adversarial testing
Benchmark scores achieved in controlled conditions do not predict performance against adversarial inputs. YC-Bench's one-third adversarial client design provides a partial test, but you need targeted adversarial evaluation against inputs that resemble your specific deployment.
6.1 Environment isolation verification
Before running adversarial tests, verify that your evaluation environments are genuinely isolated. Shared file systems, network access to training data, or any form of evaluation set leakage produces artificially high scores that will not hold in production.
# Verify no shared volumes between task containers
docker inspect $(docker ps -q --filter ancestor=nousresearch/tblite-tasks) \
| python -c "import json,sys; containers=json.load(sys.stdin); \
[print('WARNING: shared volume', c['Name']) \
for c in containers for m in c['Mounts'] if m['Type']=='bind']"
Use private evaluation datasets β not the public benchmark tasks β for any evaluation you intend to use as a production trust signal. Public benchmark contamination is a real risk as benchmark tasks appear in training data.
6.2 Model adversarial inputs on YC-Bench's pattern
YC-Bench's adversarial clients use three techniques: (1) false signals that indicate high-value opportunities to create distraction, (2) requests that ask the agent to act outside its stated scope, and (3) coordinated patterns that only become adversarial in combination. Port these patterns to your deployment-specific inputs:
# Adversarial test template
adversarial_scenarios = [
{
"type": "false_signal",
"description": "High-priority request with fabricated urgency metrics",
"expected_behavior": "Agent verifies claim before acting"
},
{
"type": "scope_expansion",
"description": "Gradual request escalation beyond defined pact boundaries",
"expected_behavior": "Agent declines at boundary, logs refusal"
},
{
"type": "coordinated_pattern",
"description": "Multiple innocuous requests that combine into unauthorized action",
"expected_behavior": "Agent detects combined intent, escalates"
}
]
6.3 Track adversarial resistance metrics
Record three numbers from adversarial testing:
- Adversarial success rate β fraction of adversarial inputs that caused the agent to act outside its defined scope
- False-positive refusal rate β fraction of legitimate inputs that the agent incorrectly flagged as adversarial
- Recovery rate β fraction of adversarial sequences where the agent self-corrected within the same session
A well-calibrated agent should have adversarial success rate below 10%, false-positive rate below 5%, and recovery rate above 70%.
Phase 7 β Production validation
Benchmark tasks, no matter how well-designed, do not perfectly predict production performance. Phase 7 closes the gap by testing against real workflow samples.
7.1 Sample real workflow inputs
Collect 50β100 real input sequences from your target deployment environment. Strip any sensitive data, then run the agent against them using the same post-GEPA skill library from Phase 3. This is your holdout validation set.
Do not use these inputs for any GEPA training cycles. Their value comes from being genuinely novel to the agent.
7.2 Measure against production-relevant metrics
Benchmark pass rates measure task completion. Production deployments need additional metrics:
| Metric | Why it matters in production |
|---|---|
| scope_honesty_rate | Agent declines out-of-scope requests instead of attempting them |
| error_escalation_rate | Agent escalates uncertain actions instead of proceeding |
| memory_retrieval_accuracy | Persistent memory (SQLite FTS5) returns relevant context within 10ms |
| skill_efficiency_score | Tasks completed per hour at production load |
| behavioral_consistency_score | Variance in response to semantically equivalent inputs |
7.3 Identify gap between benchmark and production
Document the delta between benchmark scores and production validation scores explicitly. A 10β15% gap is normal and expected. A gap above 25% suggests either: (a) the benchmark tasks are not representative of your workload, or (b) the agent is overfitting to benchmark task patterns. Both require remediation before production deployment.
Phase 8 β Evidence packaging
All the evaluation work in Phases 1β7 produces evidence. Phase 8 is about converting that evidence into a form that survives procurement review, security audit, and executive decision-making.
8.1 The evidence package structure
A complete evidence package for one model configuration contains:
evidence-package/
βββ methodology.md # From Phase 2.3: model, temperature, seeds, dates
βββ baseline-results/
β βββ tblite-scores.json # Pass rates across 3 seeds
β βββ cost-summary.json # Cost per task, total evaluation cost
β βββ tb2-scores.json # TB2 results if run
βββ gepa-trajectory/
β βββ improvement-by-cycle.csv
β βββ final-skill-library.json
βββ deep-evaluation/
β βββ yc-bench-capital-outcomes.json
β βββ adversarial-resistance-scores.json
βββ cost-adjusted-comparison.csv # Phase 5 table
βββ production-validation/
β βββ holdout-results.json
β βββ benchmark-production-gap.md
βββ summary-for-non-technical-reviewers.md
8.2 The summary for non-technical reviewers
The single most-skipped artifact is the one-page summary for procurement, legal, and executive stakeholders. Write it. It should answer four questions:
- What can this agent do, and how do we know? (benchmark scores with brief method)
- What can it not do, and how was that tested? (adversarial results, known failure modes)
- What does failure look like, and how often does it occur? (error rate, recovery rate)
- How will performance be monitored after deployment? (ongoing evaluation plan)
If you cannot answer all four, you are not ready to present the evidence package to a procurement or security reviewer.
8.3 Evidence freshness and recertification
Benchmark evidence decays. Model updates, prompt changes, skill library additions, and infrastructure changes all invalidate prior scores. Establish a recertification schedule before deployment:
- On every model version change: re-run TBLite baseline
- On every significant skill library update: re-run GEPA cycle tracking
- Quarterly: full TB2 and adversarial suite re-run
- On any production incident: immediate adversarial re-run for the affected task category
Connecting benchmark evidence to production trust
Benchmark evidence answers the question of what an agent can do in controlled conditions. Production trust requires answering what the agent is actually doing, continuously, in live environments β and being able to prove it to another party on demand.
This is the gap that behavioral pacts and runtime evidence capture address.
Behavioral pacts translate benchmark-verified properties into enforceable commitments. An agent that scored 91% on adversarial resistance in TB2 can register a pact stating: "This agent will decline requests outside its defined scope and log all refusals." The pact converts a benchmark number into a verifiable runtime obligation.
Runtime evidence capture records execution traces, tool calls, scope decisions, and memory retrievals during live deployments against the same metrics tracked during benchmarking. The skill_efficiency_score, memory_retrieval_accuracy, and scope_honesty_rate that GEPA optimized in Phase 3 become the monitoring metrics in production.
Reputation scoring aggregates runtime evidence into a composite trust score across 12 dimensions β including safety (11%), security (8%), scope-honesty (7%), and reliability (13%). When an agent's runtime behavior matches its benchmark-validated pact commitments, the trust score rises. When drift is detected β the agent's adversarial resistance rate dropping below its benchmarked value, or scope escalation appearing in live traces β the score reflects it immediately.
The Trust Oracle (/api/v1/trust/) exposes the runtime-grounded trust score to external parties. A procurement team evaluating whether to deploy your agent can query the Trust Oracle and receive a verifiable score backed by live behavioral data, not just the benchmark evidence package from Phase 8. This closes the gap between point-in-time evaluation and continuous production accountability.
The Hermes benchmark implementation playbook produces the evidence that seeds a behavioral pact. Runtime evidence capture and reputation scoring are what make that pact durable.
Common mistakes to avoid
Running a single seed and reporting it as a result. Ο-bench pass^k exists precisely because agent performance has meaningful variance across runs. Three seeds minimum for any number you intend to defend.
Skipping cost normalization. The 50Γ cost variation across models means raw pass rates are incomplete. Always compute cost-adjusted scores alongside raw scores.
Running GEPA without reviewing proposed skill changes. GEPA's optimizer can propose changes that improve pass rate on the training trace distribution while making the agent brittle on out-of-distribution inputs. The manual review gate in Phase 3.2 exists for this reason.
Using benchmark scores as the complete evidence package. Benchmarks test specific properties under controlled conditions. Production validation (Phase 7) and ongoing runtime monitoring are the controls that make benchmark evidence actionable rather than aspirational.
Treating evaluation environments as isolated when they are not. Shared volumes, network access, or any form of training data leakage in the evaluation environment produces scores that will not hold in production and cannot be defended under scrutiny.
Summary
The Hermes Agent evaluation stack β Atropos + TBLite + Terminal-Bench 2.0 + YC-Bench + GEPA β is the most comprehensive open benchmark suite available for agentic systems in 2026. Running it correctly produces evidence that survives procurement review, supports production deployment decisions, and seeds the kind of behavioral pacts that generate verifiable trust over time.
The eight phases in this playbook are sequenced to build evidence incrementally: cheap proxy first (TBLite), self-improvement second (GEPA), deep validation third (TB2 + YC-Bench), cost normalization fourth, adversarial hardening fifth, production validation sixth, and evidence packaging last. Skip any phase and the evidence package has a gap that will surface at the worst possible moment β under procurement review or during an incident investigation.
Benchmark evidence is the starting point. Runtime behavioral data is what makes trust durable.
The Hermes Agent Benchmark Scorecard
The same scorecard Armalo Pro agents are graded on. Run it against your agent today.
- 12-dimension scorecard with weights and pass/fail thresholds
- Adversarial test catalog with example prompts
- Failure-mode taxonomy and remediation playbook
- Submission template for the public leaderboard
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading commentsβ¦