What is the Hermes Agent Benchmark?

Hermes Agent Benchmark is Armalo’s 16-dimension adversarial trust scorecard for AI agents. It scores agents across accuracy, reliability, self-audit (Metacal), scope honesty, safety, security, cost efficiency, latency, model compliance, runtime compliance, harness stability, and bond posture.

How is Hermes different from a standard accuracy benchmark?

Accuracy benchmarks measure how often the agent gives the right answer. Hermes measures whether the agent can be trusted in production — including how it behaves under adversarial pressure, whether it stays within its declared scope, and whether its self-reported confidence matches reality.

Can I run Hermes against my own agent?

Yes. Sign up for free, register your agent endpoint, and run the full battery — adversarial evals, jury arbitration, signed scorecard. The free tier covers one agent and three runs per month.

How long does a Hermes run take?

A first-pass Hermes run completes in 5–15 minutes for most agents. The full adversarial battery with jury arbitration is run continuously after that — your score updates as your agent behaves over time.

Is Hermes Agent Benchmark free?

The downloadable scorecard, the documentation, and the free-tier run on one agent are all free. Pro and Enterprise plans unlock continuous evals, multiple agents, audit-pack exports, and SLA-grade jury arbitration.

Hermes Agent Benchmark

The 16-dimension adversarial scorecard for AI agents

Accuracy benchmarks measure if an agent answers correctly. Hermes measures whether it can be trusted in production — under adversarial pressure, across the dimensions a buyer actually cares about, with a score you can verify.

Score my agent Download the scorecard

Free tier · 1 agent · 3 evals/month · No credit card

The 12 scored dimensions

Accuracy14%

Reliability13%

Safety11%

Self-audit (Metacal™)9%

Security8%

Try it now — no signup

Run a preview trust check on any agent

Paste an agent endpoint URL. We'll show you what an Armalo trust scorecard looks like before you sign up.

Free downloadNo credit card · Save as PDF

The Hermes Agent Benchmark Scorecard

The same 16-dimension scorecard Armalo Pro agents are graded on. Take it, copy it, run it against your own agent.

16-dimension scorecard with exact weights and pass/fail thresholds
Adversarial test catalog with example prompts you can run today
Failure-mode taxonomy and remediation playbook
Submission template for the public Hermes leaderboard

Stop talking about benchmarks. Score your agent.

Sign up, drop in an agent endpoint, watch the 16-dimension score update in real time. Free tier covers one agent and three runs per month — no credit card.

Signed, verifiable scorecard you can hand to procurement
Continuous adversarial evals + jury arbitration on Pro
OpenAPI spec — wire it into your CI in under 30 minutes

Start free Talk to Armalo

The 16-dimension adversarial scorecard for AI agents

Run a preview trust check on any agent

The Hermes Agent Benchmark Scorecard

Read the benchmark, chapter by chapter

The Complete Guide

Architecture and Control Model

Failure Modes and Anti-Patterns

Market Map and Strategic Direction

Implementation Playbook

Security, Governance and Operational Controls

Buyer and Procurement Guide

More on Hermes

Hermes Agent Benchmark: Metrics, Scorecards, and Review Cadence

Hermes Agent Benchmark Failure Modes and Anti-Patterns: Metrics and Review System

Hermes Agent Benchmark: Leadership and Board-Level Framing

Hermes Agent Benchmark vs real workflow trust: What Serious Teams Keep Confusing

Hermes Agent Benchmark Failure Modes and Anti-Patterns: Case Study and Scenarios

How the Armalo Agent Ecosystem Surpasses Hermes Agent and OpenClaw: Memory Mesh, Trust Infrastructure, and Recursive Self-Improvement

Stop talking about benchmarks. Score your agent.