Within seven weeks of release it accumulated over 95,600 GitHub stars, making it one of the fastest-growing open-source agent frameworks in the space. The project lives at github.com/NousResearch/hermes-agent with documentation at hermes-agent.nousresearch.com.
Core Architecture: Three Subsystems
Hermes Agent is built on three interlocking subsystems:
1. Three-Level Memory
- Session memory: current conversation context
- Persistent memory: long-term facts and preferences stored in SQLite with FTS5 full-text search; retrieval latency of approximately 10ms over 10,000+ documents
- Skill memory: solution patterns learned from past task completions, retrieved at inference time to accelerate future similar tasks
2. Skills System
Three tiers of skills ship with the framework: bundled skills (included by default), optional skills (official but specialized), and community skills available through the Hermes Skills Hub. Skills are the unit of learned competency β each skill encapsulates a reusable solution pattern the agent has refined through execution.
3. GEPA Self-Improvement Loop
The most technically significant component. GEPA (Genetic-Pareto Prompt Evolution) analyzes execution traces to understand why tasks fail, then generates targeted improvements to tool descriptions, system prompts, skill implementations, and code. This loop runs automatically and is described in detail in the ICLR 2026 Oral paper "Reflective Prompt Evolution Can Outperform Reinforcement Learning" by Lakshya Agrawal et al.
Hermes Agent ships with 40+ built-in tools: web search, terminal commands, file operations, browser automation, vision analysis, image generation, text-to-speech, code execution, subagent delegation, memory operations, task planning, cron scheduling, multi-model reasoning, and unified access to 200+ models via OpenRouter. This breadth is intentional β the benchmark system needs diverse tool coverage to evaluate real-world agentic capability rather than narrow scripted behaviors.
The Benchmarking Architecture: Atropos Integration
Hermes Agent's evaluation system is built on top of Atropos β Nous Research's Language Model RL Environments framework for collecting and evaluating LLM trajectories. This is the same infrastructure used for training, which means the benchmarking and training pipelines are unified rather than separate systems.
The Atropos integration enables three workflows in one environment definition:
- Benchmarks β evaluate models on standardized agentic tasks with reproducible scoring
- RL Training β train language models on multi-turn agentic tasks using GRPO (Group Relative Policy Optimization)
- Data Generation β generate SFT (Supervised Fine-Tuning) training data from agent rollouts
This unification is architecturally significant: an agent that performs well on a Hermes benchmark can immediately have that performance trajectory used to further fine-tune the underlying model. The flywheel between evaluation and improvement is not external β it's intrinsic to the framework design.
The framework exposes Prometheus metrics and logs to Weights & Biases (wandb) during evaluation runs. Core KPIs tracked:
| Metric | Description |
|---|
skill_efficiency_score | Tasks completed per hour per agent instance |
memory_retrieval_accuracy | Percentage of relevant memories successfully retrieved |
self_modification_success_rate | Ratio of accepted vs. rejected auto-generated patches |
| Success rate by task type | Broken down by tool category and domain |
| Improvement cycles needed | How many GEPA iterations to reach threshold performance |
| Token cost per execution | Critical for comparing cost-adjusted performance |
Important overhead note: The reflection and optimizer modules add approximately 15β25% extra token consumption compared to a standard non-self-improving agent. This cost is real and should be factored into any total-cost-of-ownership analysis.
Nous Research's internal benchmark data shows that agents with 20+ self-created skills complete similar future research tasks 40% faster (less token spend and wall-clock time to reach equivalent output) than fresh instances starting without accumulated skill memory. This figure is from internal evaluations backed by the GEPA paper methodology.
TBLite: The 100-Task Fast Evaluation Track
TBLite is a thin subclass of Terminal-Bench 2.0 (TB2), using the same task definitions and scoring but restricted to a 100-task difficulty-calibrated subset. It was created by the OpenThoughts Agent team (Snorkel AI + Bespoke Labs) specifically as a faster proxy for the full TB2 suite.
Why this matters for practitioners:
- TB2's full 89-task suite takes significant time and compute to run end-to-end
- TBLite enables faster iteration cycles during model development
- The 100-task subset was calibrated to match the difficulty distribution of the full benchmark, making scores directly comparable
- All 100 task environments are available as Docker images on DockerHub for reproducible local evaluation
Task categories within TBLite mirror Terminal-Bench 2.0's scope: CLI and repository workflows, configuring legacy software systems, reimplementing methods from research papers, and general software engineering tasks requiring multi-step reasoning and tool use.
YC-Bench: The Long-Horizon Strategic Benchmark
YC-Bench (arXiv: 2604.01212, published April 2026 by Collinear AI, GitHub: collinear-ai/yc-bench) is the most methodologically distinctive benchmark integrated into Hermes Agent.
Instead of short-horizon tasks, YC-Bench puts an agent in the role of CEO of an AI startup and runs it for a simulated one-year horizon spanning hundreds of decision turns, starting with $200,000 in capital. The agent operates entirely via CLI against a SQLite-backed discrete-event simulation β no web access, no external APIs, just strategic decision-making under resource constraints.
Four Business Domains
The simulation has four specialized domains that must be balanced:
- Research β fundamental capability development
- Inference β deploying and serving models
- Data β training data acquisition and curation
- Training β model training runs
Decay mechanics penalize over-specialization: all four domains need sustained investment. An agent that goes all-in on inference while neglecting research will see its competitive position erode over time. This mirrors real startup strategy challenges in a way that single-task benchmarks cannot capture.
Adversarial Client Design
A critical feature: roughly one-third of clients in the simulation are adversarial and designed to fail. They appear indistinguishable from legitimate clients at first interaction and must be identified through behavioral signals over time. An agent that accepts every client engagement will consume resources on unrecoverable relationships.
This adversarial design is intentional and important β most agent benchmarks assume cooperative task environments. YC-Bench explicitly tests whether an agent can detect bad-faith actors, a capability that matters enormously in real-world deployment.
Benchmark Results: 12 Models, 3 Seeds Each
Collinear AI ran YC-Bench across 12 models with 3 seeds each (36 total runs per model for statistical reliability). Results are deterministic given a fixed seed.
| Model | Average Final Funds |
|---|
| Claude Opus 4.6 | $1,270,000 |
| GLM-5 | $1,210,000 |
| Starting capital | $200,000 |
Only 3 of 12 models consistently exceeded the $200,000 starting capital. The GLM-5 result is particularly notable: it achieves comparable performance to Claude Opus 4.6 at 11Γ lower inference cost, making it the dominant choice on a cost-adjusted basis for long-horizon workloads.
The $1.27M vs $200K gap illustrates what the benchmark is measuring: it's not whether an agent can complete a task, but whether it can compound value over hundreds of sequential decisions while managing adversarial inputs and resource constraints simultaneously.
Terminal-Bench 2.0: The Gold Standard CLI Benchmark
Terminal-Bench 2.0 (arXiv: 2601.11868, released November 2025) is the full benchmark that TBLite derives from. It's the most credible current benchmark for CLI-capable agents and directly relevant to evaluating Hermes Agent's core use cases.
Methodology
- 89 tasks across three categories: legacy system configuration, research paper reimplementation, and general software engineering
- Every task manually verified by 3 independent human reviewers β no automatic test suite generation
- All solutions containerized in Docker for fully reproducible evaluation
- Leaderboard tracked at tbench.ai/leaderboard/terminal-bench/2.0 and vals.ai/benchmarks/terminal-bench-2
Current Leaderboard (April 2026)
| Model | Score |
|---|
| Claude Mythos Preview | 82.0% |
| GPT-5.3 Codex | 77.3% |
| GPT-5.4 | 75.1% |
27 models have been evaluated as of April 2026. No model could previously complete all 89 tasks β the benchmark was specifically designed so that "finishing" requires broad CLI and reasoning capability rather than narrow pattern matching.
Terminal-Bench 1.0 launched in May 2025 and quickly became a standard. TB2.0 raised the bar significantly by replacing auto-generated test suites with human-verified task specifications, which substantially reduces the "tricks" that can inflate scores without genuine task completion.
GEPA: The Self-Improvement Engine
GEPA (Genetic-Pareto Prompt Evolution) is the self-improvement component that makes Hermes Agent's benchmark approach fundamentally different from static model evaluation. It's described in detail in the ICLR 2026 Oral paper "Reflective Prompt Evolution Can Outperform Reinforcement Learning" by Lakshya Agrawal et al., with the implementation at github.com/NousResearch/hermes-agent-self-evolution.
How GEPA Works
- Trace analysis: GEPA reads execution traces to understand why specific tasks failed β not just that they failed
- Targeted proposals: Based on failure analysis, it proposes targeted improvements to tool descriptions, system prompts, skill implementations, and code
- Genetic selection: Improvements are evaluated across a Pareto frontier (balancing quality, efficiency, and generalization) using genetic selection
- Integration with DSPy: GEPA is integrated into the DSPy framework and works with as few as 3 training examples
| Method | Improvement | Rollouts Required |
|---|
| GRPO (RL baseline) | Baseline | ~N rollouts |
| GEPA | +6% avg, up to +20% specific tasks | ~N/35 rollouts |
The 35Γ reduction in rollout requirements is the critical advantage. RL-based training requires massive amounts of environment interaction to converge. GEPA achieves superior results by reasoning about why things fail rather than blindly exploring through trial and error.
On the MATH benchmark: GEPA-optimized programs reached 93% accuracy compared to 67% with basic ChainOfThought β a 26-point gain. This level of improvement from prompt evolution alone, without weight updates, is significant and was part of what earned the paper its ICLR 2026 Oral designation.
GEPA was first shipped in Hermes Agent v0.8.0 on April 8, 2026.
The Full Agent Benchmark Landscape: How Hermes Compares
Understanding Hermes Agent's benchmark approach requires understanding the full landscape of AI agent evaluation. Here is a comprehensive breakdown of every major benchmark active in 2025β2026.
SWE-bench / SWE-bench Verified
Paper: arXiv 2310.06770 | Leaderboard: swebench.com
SWE-bench evaluates agents on 2,294 real GitHub issues from popular Python repositories. The agent must submit a code patch that makes failing tests pass.
- SWE-bench Verified (500 human-verified issues) is the most trusted signal β the full 2,294-task set has known noise from auto-generated test suites
- April 2026 leaders: Claude Opus 4.7: 87.6%, GPT-5.3-Codex: 85.0%, Claude Opus 4.5: 80.9%
- SWE-bench Pro (harder subset): Claude Opus 4.7 leads at 64.3%
- Known limitation: 7.7% of SWE-bench-Lite and 5.2% of Verified have test validity issues β incorrect patches can still pass the test suite
SWE-bench is the gold standard for coding-specific agent evaluation and is directly relevant for teams deploying agents on software engineering workflows.
GAIA
Paper: arXiv 2311.12983, Meta AI, November 2023 | Leaderboard: hal.cs.princeton.edu/gaia
GAIA uses 466 real-world questions requiring multi-step reasoning, multi-modality, web browsing, and tool use. Answers are short and unambiguous (cheap to verify automatically).
- Three difficulty levels
- Human performance: 92%
- GPT-4 with plugins at release: 15% β GPT-4 did not exceed 30% even on the easiest questions
- Performance has improved substantially since 2023 as agent frameworks matured
Critical limitation: The benchmark now has documented vulnerabilities. Approximately 98% of tasks are exploitable via public answers on HuggingFace and normalization collisions. This means leaderboard scores above a certain threshold should be interpreted with significant caution about whether they reflect genuine reasoning ability.
AgentBench
Paper: arXiv 2308.03688, Tsinghua University (THUDM), ICLR 2024 | GitHub: THUDM/AgentBench
AgentBench evaluates LLMs as agents across 8 distinct environments: OS interaction, database operations, knowledge graph navigation, digital card games, web browsing, and three additional environments created specifically for the benchmark.
- 29 API-based and open-source LLMs tested at release
- Primary finding: long-term reasoning, multi-step decision-making, and precise instruction following are the dominant bottlenecks separating capable from incapable agents β not raw knowledge
- Best models (GPT-4-based) significantly outperform all open-source alternatives at release
AgentBench's multi-environment design makes it useful for identifying which types of reasoning an agent struggles with specifically, rather than producing a single aggregate score.
WebArena
Paper: arXiv 2307.13854, ICLR 2024 | GitHub: web-arena-x/webarena
WebArena evaluates agents on 812 long-horizon web tasks across 4 self-hosted domains: e-commerce (OneStopShop), social forums (Reddit-like), collaborative software development (Gitlab), and content management (CMS).
- Tasks are derived from 241 templates to reduce memorization
- Human success rate: 78.24%
- Original best GPT-4-based agent: 14.41% β a massive gap from human performance at release
- By 2025β2026: AI agents reached approximately 60% success rate
- Known limitation: Overestimates performance by approximately 5.2% due to string matching evaluation issues
- Extensions: VisualWebArena (visual tasks), VideoWebArena (video understanding), WebChoreArena (complex multi-app workflows)
The 60% success rate in 2025β2026 vs. 14% in 2023 illustrates the pace of capability improvement in web agents specifically.
OSWorld
Paper: arXiv 2404.07972, NeurIPS 2024 | GitHub: xlang-ai/OSWorld
OSWorld benchmarks multimodal agents on 369 tasks across real desktop applications on Ubuntu, Windows, and macOS. Agents operate via screenshot + keyboard/mouse actions β no API access to application state.
- Human baseline: 72.36%
- Best model at NeurIPS 2024 publication: 12.24% β the largest human-AI gap of any major benchmark at release
- Claude 3.7 (February 2025): ~28% (100 steps budget)
- Agent S2 with Claude 3.7 (2025): 34.5% (50 steps)
- OSAgent (October 2025): 76.26% β first system to exceed human baseline
- Known exploitation vulnerability: 73% score achievable via VM state manipulation and public gold files
OSWorld is the hardest benchmark for general desktop automation and the best signal for agents that need to operate in unrestricted computer environments.
Ο-bench (tau-bench)
Paper: arXiv 2406.12045, Sierra Research | GitHub: sierra-research/tau-bench
Ο-bench simulates dynamic customer service conversations requiring the agent to follow domain policies, use APIs, and satisfy user requests across multiple turns. Unlike most benchmarks, it explicitly models the user as an adversarial agent who may change their mind, provide incomplete information, or test the agent's policy compliance.
- Two domains: retail and airline customer service
- Novel metric: pass^k β measures reliability over k repeated trials on the same task (not just single-pass accuracy). This is critical because production customer service agents must be reliable, not merely capable on average.
- Best model at release (gpt-4o): <50% single-pass success; pass^8 <25% in retail
- Known limitation: A trivial agent with no domain knowledge can pass 38% of tasks β the benchmark has a non-trivial floor
- ΟΒ²-bench (2025): extends to dual-control environments where both agent and user are actively controlled
The pass^k metric is Ο-bench's most important methodological contribution. A 70% single-pass success rate sounds impressive until you realize pass^8 (the probability of succeeding on all 8 out of 8 attempts) for an independent 70% agent is only ~5.7%. Production reliability requires consistency, not just average performance.
Paper: arXiv 2307.16789, ICLR 2024 Spotlight | GitHub: OpenBMB/ToolBench
ToolBench evaluates tool-use capability across 16,464 real RESTful APIs from RapidAPI Hub, spanning 49 functional domains. It's the largest-scale API benchmark by number of APIs covered.
- 126,000+ instructionβsolution path pairs in training data
- Three evaluation splits: I1 (single-tool), I2 (intra-category multi-tool), I3 (intra-collection multi-tool)
- Evaluation metric: ToolEval (LLM-based evaluator for pass rate and win rate)
- ToolLLaMA fine-tuned on this data achieves performance comparable to ChatGPT on in-distribution tools
- Limitation: evaluating real APIs introduces external dependency issues (rate limits, API deprecation)
The Benchmark Vulnerability Problem: What Berkeley RDI Found
In early 2026, researchers from Berkeley RDI published findings that every major agent benchmark can be exploited to achieve near-perfect scores without solving a single task. This is not a theoretical concern β it's a documented empirical finding with specific exploitation vectors for each benchmark.
| Benchmark | Exploitation Method | Exploitable Score |
|---|
| WebArena | Config leakage, DOM injection, prompt injection | ~100% |
| GAIA | Public answers on HuggingFace + normalization collisions | ~98% |
| OSWorld | VM state manipulation + public gold files | 73% |
| SWE-bench | Agent writes state to environment evaluator reads | Partial |
Source: rdi.berkeley.edu/blog/trustworthy-benchmarks-cont
The practical implications for anyone making decisions based on leaderboard scores:
-
Leaderboard rank does not imply tamper-proof evaluation. Any sufficiently motivated team can inflate scores by exploiting evaluation infrastructure rather than improving agent capability.
-
Controlled, adversarial evaluation environments are necessary for production trust. Benchmarks where the task descriptions and gold answers are publicly available online β and where the evaluation infrastructure shares state with the agent β cannot be treated as authoritative.
-
Methodology disclosure matters more than score. When a vendor publishes a benchmark result, the critical questions are: Was the evaluation infrastructure isolated? Was the evaluation dataset private? Were independent evaluators used? Without this information, the score is marketing, not measurement.
This is one reason Hermes Agent's approach of integrating Atropos (with containerized, reproducible evaluation environments) and coupling evaluation to GEPA improvement cycles is architecturally sound β the evaluation cannot be gamed by leaking gold answers when the agent is actively evolving against the task distribution.
What Hermes Agent Benchmark Scores Actually Tell You
Interpreting benchmark scores correctly requires understanding what they measure and what they miss. Here is an honest accounting:
What They Measure Well
- Task completion rate on specific distributions: If TBLite tasks match your actual deployment use cases, TBLite scores are meaningful
- Relative ranking: Model A outperforming Model B by a consistent margin across multiple benchmark runs is a reliable signal
- Improvement trajectory: Whether a self-improving agent (via GEPA) is getting better over time on the evaluation distribution
- CLI and coding capability (Terminal-Bench 2.0, TBLite): Among the best benchmarks for software engineering agent evaluation
What They Miss
- Cost-adjusted performance: Most benchmarks report task success rate without normalizing for API cost. As noted in the research, agents making up to 2,000 API calls per task with 50Γ cost variations ($0.10β$5.00) for similar accuracy levels create dramatically different unit economics
- Reliability at scale: Single-pass success rates don't predict pass^k behavior (see Ο-bench above). An agent with 80% single-pass success on a benchmark would achieve <2% reliability on 10 consecutive trials of the same task
- Complex internal data access: Benchmarks use public or synthetic data. Real business workflows require agents to navigate complex internal databases, permission systems, and organizational context
- Adversarial robustness: Most benchmarks assume cooperative environments. YC-Bench's adversarial client design is unusual and valuable precisely because it tests detection of bad-faith inputs
- Human-in-the-loop performance: Benchmarks typically assume fully autonomous operation. Production workflows rarely do
The Hermes Model Family: Not the Same as the Agent Framework
A common source of confusion: "Hermes" refers to two related but distinct Nous Research products.
Hermes models are fine-tuned LLMs (not an agent framework). Key releases:
| Model | Base | Paper | Release |
|---|
| Hermes 3 | Llama 3.1 | arXiv 2408.11857 | August 2025 |
| Hermes 4 (405B) | Llama 3.1-405B | arXiv 2508.18255 | August 25, 2025 |
| Hermes 4 (14B) | Qwen3 14B | Same | Same |
Hermes 4 introduced hybrid reasoning + large-scale synthetic data generation (~5 million training samples, ~19 billion tokens). It's optimized for function calling, tool use, and instruction following β the model-level capabilities that make the agent framework effective.
The Hermes Agent framework (evaluated by Hermes benchmarks) is the agent system built on top of Hermes and other models via OpenRouter. A Hermes Agent instance can use Hermes 4, Claude, GPT-5, or any of 200+ other models as its underlying reasoning engine. The benchmark evaluates the agent system, not just the model.
Other Research Using the "Hermes" Name
Two other research papers use "Hermes" in distinct contexts:
HERMES for Mathematical Reasoning (arXiv: 2511.18760, November 2024):
- Acronym: Hybrid Agent for Reasoning in Mathematics with Neuro-Symbolic Lean4 verification
- First tool-assisted agent that interleaves informal reasoning with formally verified proof steps in Lean4
- Four modules: LLM reasoning generator β Lean translator β symbolic prover β feedback module
- Evaluated on four mathematical reasoning benchmarks
- Not related to Nous Research or the Hermes Agent framework
Hermes for Autonomous Networks (arXiv: 2411.06490, November 2024):
- From Huawei Paris Research Centre, Khalifa University, CUHK, Yale
- LLM-chained framework for Network Digital Twin construction in telecom environments
- Uses GPT-4o; achieves up to 82.5% success on diverse network tasks vs. 5% for CoT baseline on complex tasks
- Not related to Nous Research or the Hermes Agent framework
Implementation Guide: Using Hermes Agent Benchmarks in Practice
For teams wanting to evaluate AI agents using Hermes Agent's benchmark infrastructure:
Step 1: Choose Your Benchmark Track
| Use Case | Recommended Track |
|---|
| Fast iteration / development | TBLite (100 tasks, Docker-containerized) |
| Production CLI agent evaluation | Terminal-Bench 2.0 (89 tasks, human-verified) |
| Long-horizon strategic reasoning | YC-Bench (CEO simulation, hundreds of turns) |
| Full self-improvement assessment | Run TBLite before and after N GEPA cycles |
Step 2: Establish a Cost-Adjusted Baseline
Always record both success rate and total API cost per benchmark run. A model achieving 80% on TBLite at $0.10/task is categorically different from one achieving 82% at $3.50/task. Most leaderboards omit cost data; you must instrument it yourself.
Step 3: Run Multiple Seeds
Follow the YC-Bench methodology: minimum 3 seeds per model. Stochastic agents can have substantial run-to-run variance. A single-run leaderboard position is not reproducible evidence.
Step 4: Evaluate Self-Improvement Trajectory, Not Just Point-in-Time Score
The unique value of Hermes Agent's benchmark approach is measuring improvement over time via GEPA. The right evaluation asks:
- What is the baseline score on TBLite at initialization?
- What is the score after 10 GEPA improvement cycles?
- What is the rate of improvement per cycle?
- Does improvement on TBLite transfer to Terminal-Bench 2.0?
A static leaderboard score tells you where an agent is. The improvement trajectory tells you where it's going.
Step 5: Test Adversarial Robustness Separately
Mainstream benchmarks are cooperative. Before production deployment, run the agent against adversarial inputs modeled on YC-Bench's one-third-adversarial-client design: inputs that appear legitimate but are designed to waste resources or manipulate the agent into policy violations.
Step 6: Compare Against Your Actual Workflow
Benchmark tasks are proxies. The final validation is always: does this agent perform on the actual task distribution you're deploying it against? Use benchmark scores to narrow your candidate pool, then evaluate surviving candidates on real (sanitized) examples from your workflow.
The Trust Gap: From Benchmark to Production
Even a benchmark result you trust fully β obtained with reproducible methodology, multiple seeds, and independent evaluation β leaves substantial gaps when you try to make a production deployment decision.
The gaps are structural, not about benchmark quality:
Gap 1: Task distribution mismatch. No public benchmark covers your organization's internal systems, databases, policies, and edge cases. The closer your use case is to benchmark tasks, the more predictive the score. The further away, the less.
Gap 2: No behavioral pacts. Benchmarks measure past performance on known distributions. They make no promise about future behavior on novel inputs. An agent that achieves 90% on a benchmark has no formal obligation to maintain that performance in production, or to notify you when it degrades.
Gap 3: No consequence accountability. If a benchmark-validated agent causes a production incident, the benchmark score neither helps you diagnose the failure nor gives you recourse. The benchmark evaluation chain is severed from the operating chain.
Gap 4: No reputation continuity. An agent that fails three times in a row on similar tasks should be treated differently than one with a clean record. Benchmark snapshots don't accumulate reputation over time.
This is the problem Armalo is designed to close. By connecting agent benchmarks to behavioral pacts (what the agent promises to do), runtime evidence (what it actually did), and reputation scoring (the accumulated record of both), it becomes possible to make deployment decisions that survive skeptical review from procurement, security, and executive stakeholders β not just from the engineering team that ran the benchmark.
How Armalo Extends Agent Benchmark Evidence
Armalo's trust layer adds four properties to agent benchmark data that benchmarks alone cannot provide:
1. Pacts: Formalize what the agent promises β success rate thresholds, latency targets, cost ceilings, behavioral constraints. A pact is a machine-readable contract the agent signs, evaluated against on every run.
2. Runtime Evidence: Capture what actually happened β not just pass/fail, but tool call traces, cost, latency, error patterns, and edge case behavior. This evidence is portable and inspectable by any stakeholder.
3. Reputation Scoring: Aggregate evidence across time into a composite score that tracks improvement, degradation, and anomaly. A model that passed a benchmark six months ago and has been accumulating failures since is scored differently than one with consistent recent performance.
4. Trust Oracle: The Armalo Trust Oracle (/api/v1/trust/) exposes a public, queryable reputation API so other platforms, buyers, and counterparties can verify an agent's behavioral record before relying on it. This turns benchmark data into a trust signal that travels with the agent across deployments.
This is analogous to how a FICO score works for credit: it doesn't guarantee future repayment, but it converts a history of behavioral evidence into a decision-grade signal that third parties can use without running the full analysis themselves.
Frequently Asked Questions
Is Hermes Agent Benchmark a standalone benchmark like GAIA or SWE-bench?
No. It's an evaluation subsystem integrated into the Hermes Agent framework, connecting to the Atropos RL training environment. It incorporates existing benchmarks (TBLite, YC-Bench, Terminal-Bench 2.0) rather than defining new task datasets from scratch.
Depends on the benchmark. For Terminal-Bench 2.0 (CLI tasks): Claude Mythos Preview at 82%. For YC-Bench (long-horizon strategy): Claude Opus 4.6 at $1.27M average final funds. For SWE-bench (coding): Claude Opus 4.7 at 87.6%. No single model dominates across all dimensions.
How much does the GEPA overhead cost?
Approximately 15β25% additional token consumption compared to a non-self-improving agent. This overhead produces the 40% task completion speedup after 20+ skill cycles, making it net-positive for most long-running deployments.
Are benchmark scores reproducible?
Depends on the methodology. Scores run against containerized environments (Terminal-Bench 2.0, TBLite) with fixed seeds (YC-Bench) are fully reproducible. Scores run against benchmarks with public answer keys (GAIA) or shared evaluation state (some SWE-bench configurations) may not reflect genuine capability.
What should a buyer ask when a vendor quotes a benchmark score?
Ask: (1) Which benchmark, exactly? (2) What subset was used? (3) How many seeds were run? (4) What was the API cost per task? (5) Was the evaluation infrastructure isolated? (6) What was the score on a held-out subset the vendor hasn't reported before?
How does GEPA relate to GRPO (RL)?
GEPA and GRPO are both optimization methods for agent behavior. GRPO uses reinforcement learning with massive rollout budgets. GEPA uses genetic prompt evolution with 35Γ fewer rollouts. GEPA achieves better average results (6% improvement) and better peak results (up to 20% on specific tasks) at a fraction of the compute cost.
Key Takeaways
- Hermes Agent is a self-improving agent framework with an integrated evaluation system built on Atropos β not a standalone benchmark dataset.
- TBLite (100 tasks), YC-Bench (long-horizon CEO simulation), and Terminal-Bench 2.0 (89 manually verified CLI tasks) are the three primary benchmark tracks.
- GEPA (ICLR 2026 Oral) provides 35Γ rollout efficiency over GRPO and up to 20% improvement on specific tasks β the self-improvement loop is the framework's defining technical contribution.
- Real scores: Claude Opus 4.7 leads SWE-bench at 87.6%; Claude Mythos Preview leads Terminal-Bench 2.0 at 82%; Claude Opus 4.6 leads YC-Bench at $1.27M final funds.
- Every major benchmark has documented exploitability β GAIA is 98% exploitable, WebArena ~100%, OSWorld 73%. Methodology disclosure matters more than headline score.
- Benchmark scores leave a structural trust gap: no behavioral pacts, no consequence accountability, no reputation continuity. Armalo's trust layer closes this gap by connecting evaluation evidence to runtime behavior, reputation scoring, and queryable trust oracles.
For more on agent trust infrastructure, behavioral pacts, and the Armalo Trust Oracle: /docs | /leaderboard | /explore