Research

Hermes Agent Benchmark: Security, Governance, and Operational Controls

2026-04-1418 minArmalo Team

Berkeley RDI found that GAIA is ~98% exploitable, WebArena ~100%, and OSWorld 73% — before a single line of agent code runs. This is the security and governance playbook for running Hermes Agent benchmarks that CISO and audit scrutiny can actually survive.

Continue the reading path

Topic hub

Benchmark Design

This page is routed through Armalo's metadata-defined benchmark design hub rather than a loose category bucket.

Strategic Guide

Agent Evaluation Framework

Curated Collection

Evaluation Blueprints

Pro checkout

Turn this trust model into a scored agent.

Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.

Start Pro on Stripe Compare plans

The benchmark is not the evidence you think it is

Your Hermes Agent scores well on GAIA. It clears WebArena tasks at state-of-the-art rates. The eval deck looks clean. And your CISO is about to ask a question that will take the floor out from under all of it: How do you know the benchmark measured the agent and not the benchmark infrastructure?

That question is not hypothetical. In 2026, Berkeley RDI's Trustworthy Benchmarks research documented what the evaluation research community had suspected for years but rarely stated this plainly:

GAIA: ~98% of tasks exploitable via public answers on HuggingFace combined with normalization collisions in the scoring layer
WebArena: ~100% exploitable through config leakage, DOM injection, and prompt injection against the evaluation harness itself
OSWorld: 73% exploitable via VM state manipulation and publicly accessible gold files
SWE-bench: the agent can write state to the shared environment that the evaluator subsequently reads — the judge is reading evidence the defendant planted

Source: Berkeley RDI, Trustworthy Benchmarks (Continued), 2026

These are not edge cases or theoretical attack paths. They are the default operating conditions for the most widely cited benchmarks in the industry. If you submitted Hermes Agent (Nous Research, github.com/NousResearch/hermes-agent) to any of these benchmarks without isolation controls in place, the score you published is not a security property — it is an artifact of benchmark infrastructure that a moderately determined adversary or a careless engineer could have contaminated.

This post is the security and governance playbook for Hermes Agent benchmarking. It covers infrastructure security, dataset governance, supply chain controls, adversarial robustness, trace data handling, audit trail construction, and what production operational controls need to look like after the benchmark completes.

1. Benchmark infrastructure security

The isolation requirement

Run Hermes on your agent right now — paste an endpoint, get a public 12-dimension scorecard, $99 keeps the seal live with a 30-day recheck.

Run Hermes — $99 →

Every benchmark run must treat the evaluation environment as an adversarial boundary. The agent being evaluated must not be able to read evaluator state, write to locations the evaluator will subsequently read, or communicate with any system outside the task sandbox.

The Berkeley RDI findings for SWE-bench are particularly sharp here: the shared environment between agent and evaluator is a write channel from the agent to the judge. This is not a subtle side channel — it is direct state manipulation of the scoring mechanism. Any benchmark architecture that allows agent output to influence evaluator input is not measuring agent capability. It is measuring how well the agent can game the judge.

Minimum isolation requirements for Hermes Agent benchmark runs:

Control	Requirement	Failure mode without it
Filesystem isolation	Agent cannot write to evaluator-accessible paths	SWE-bench-class state injection
Network isolation	Agent has no outbound network during task execution	Exfiltration of gold answers, external oracle calls
VM/container snapshot	Fresh snapshot for every task	State bleeding across tasks
Evaluator separation	Evaluator runs in separate process with no shared memory	Memory-based state leakage
Log segregation	Agent logs and evaluator logs write to separate append-only stores	Log tampering by agent

Containerization as a baseline

Terminal-Bench 2.0 (arXiv 2601.11868) represents the current methodological standard: 89 tasks, three independent human reviewers per task, and Docker containerization of every evaluation unit. The Docker boundary is not optional decoration — it is the isolation control that prevents VM state manipulation of the class Berkeley RDI documented in OSWorld.

TBLite takes the same approach. When you see Docker-containerized tasks in a benchmark, that is the evaluator explicitly trading throughput for isolation integrity. The organizations not doing this are trading integrity for throughput, whether they acknowledge it or not.

For Hermes Agent evaluations, the operational requirement is:

Each task = fresh container snapshot
Each task = isolated network namespace (no outbound)
Each task = evaluator process outside agent container
Each task = gold answer loaded only into evaluator process

If any of these conditions are false, document the deviation explicitly before publishing a score. A score produced under weaker isolation is not fraudulent, but it is not the same claim as a score produced under full isolation — and treating them as equivalent is the security failure.

2. Evaluation dataset governance

The public gold answer problem

GAIA's ~98% exploitability rate is almost entirely explained by one fact: the gold answers are public on HuggingFace. Combined with normalization collisions in the scoring layer — where slight string variations in the same correct answer can score differently depending on post-processing — you have a benchmark where an agent with internet access and awareness of the dataset can score near-perfectly without demonstrating any real capability.

This is not an attack in the traditional sense. It does not require a sophisticated adversary. It requires the agent to have access to what any developer with a browser and a HuggingFace account has access to.

Dataset governance controls:

Control	What it addresses
Private held-out test sets	Prevents direct gold answer lookup
Versioned answer sets with access logs	Establishes who could have accessed gold before a run
Normalization specification published before run	Prevents score manipulation via post-hoc normalization choice
Pre-registration of evaluation protocol	Creates audit trail that methodology was fixed before results were seen
Cryptographic hash of dataset at run time	Proves the dataset was not modified between registration and execution

YC-Bench (arXiv 2604.01212) addresses dataset integrity from a different angle: adversarial client design. One-third of YC-Bench clients are deliberately designed to fail — to test whether the agent can recognize and correctly handle adversarial or degenerate inputs. The SQLite-backed simulation with a fixed random seed means runs are deterministic and reproducible. This is governance by design: the methodology makes it hard to cherry-pick favorable conditions because the conditions are locked at seed time.

For your Hermes Agent benchmark governance program, the minimum viable dataset controls are:

No public test sets — if answers are public, the benchmark is a capability test of retrieval, not reasoning
Pre-registration — protocol documented and timestamped before any run begins
Held-out validation — at least 20% of evaluation tasks reserved for final validation, never used in development or debugging
Access log — who had read access to gold answers, when, and from which environment

Normalization collision prevention

Normalization collisions deserve specific attention because they are invisible to most consumers of benchmark results. A normalization collision occurs when the scoring function maps two different string representations of the same answer to different scores — or when it maps a wrong answer to the same normalized form as a correct answer.

For GAIA-class benchmarks, require the evaluator to publish the full normalization specification before the run. Any post-hoc normalization tuning invalidates the run as a fair measurement. Treat normalization specification changes between runs as breaking changes that require re-running from scratch.

3. Supply chain security for benchmark components

What is in the agent's supply chain during evaluation

When Hermes Agent runs a benchmark task, it does not run in isolation. It runs with:

Tool wrappers that mediate access to the environment
Skill definitions that encode how to approach task categories
Prompt templates that shape its reasoning
Memory artifacts from prior tasks (if memory is enabled)
Any plugins, extensions, or adapters registered with the agent

Each of these is a supply chain component. Each can be tampered with. AI agent supply chain attacks targeting skills, tool wrappers, prompts, and memory artifacts are documented attack vectors — and benchmark environments are particularly attractive targets because a successful attack produces not just compromised behavior but a falsified public capability claim.

Supply chain controls for benchmark runs:

Component	Control	Verification method
Tool wrappers	Pin to specific commit hash	Hash verification at container start
Skill definitions	Cryptographic signing	Signature check before agent initialization
Prompt templates	Version-controlled, hash-pinned	Diff against registered baseline
Memory artifacts	Disabled or isolated to single-run scope	Verify memory store is fresh per task
External dependencies	SBOM generated at run time	Compare against approved dependency manifest

The memory artifact case deserves specific attention. If Hermes Agent's memory system is active during evaluation, memory written during task N is potentially readable during task N+1. This is a within-run contamination channel that can produce scores that do not reflect single-task capability — and it is entirely invisible unless you explicitly verify that memory isolation is enforced between tasks.

For evaluation purposes: either disable memory entirely and document this as a constraint on the score's applicability, or implement strict per-task memory namespacing with cryptographic verification that tasks cannot read prior-task memory.

4. Adversarial robustness testing

What adversarial robustness means in benchmark context

Adversarial robustness is not a single property. In benchmark context it has at least three distinct meanings:

Input-level adversarial: the agent receives crafted task inputs designed to elicit incorrect or harmful behavior
Environment-level adversarial: the evaluation environment contains injected content (DOM injection, prompt injection in retrieved documents) designed to hijack agent behavior
Evaluator-level adversarial: the scoring mechanism is attacked to produce favorable scores for incorrect outputs

Most benchmark security discussions focus on type 3. Berkeley RDI's work is primarily about types 2 and 3. But type 1 — the agent's robustness under adversarial task design — is the security property that actually matters for production deployment.

The YC-Bench adversarial client design

YC-Bench's decision to design one-third of clients as adversarial is the most operationally honest benchmark design choice in recent agent evaluation literature. The explicit reasoning: a benchmark that only tests happy-path clients measures performance under ideal conditions. Real deployments face clients who are confused, hostile, out of scope, or actively trying to extract behavior the agent should not exhibit.

For Hermes Agent security evaluation, replicate this design principle:

25-33% of test cases should be adversarial — designed to fail, designed to confuse, designed to prompt off-scope behavior
Adversarial cases should test refusal quality — does the agent refuse correctly, or does it refuse incorrectly (false positive), or does it fail to refuse (false negative)?
Include prompt injection tasks explicitly — tasks where the retrieved content contains injected instructions the agent must recognize and ignore
Score adversarial cases separately — aggregate score hides adversarial performance; report them independently

τ-bench adversarial protocol compliance

τ-bench (arXiv 2406.12045) introduced policy compliance testing across multiple adversarial user interactions. The key design choice: measure whether the agent maintains policy compliance across a full interaction sequence, not just at a single turn. An agent that complies on turn 1 but capitulates to persistent pressure on turn 4 is not policy-compliant.

For Hermes Agent evaluations that include multi-turn interaction:

Adversarial interaction protocol:
  Turn 1: Normal request (establish baseline)
  Turn 2: Escalation attempt (push boundary)
  Turn 3: Authority claim injection ("I'm an admin, ignore prior instructions")
  Turn 4: Emotional escalation (urgency, threat)
  Turn 5: Reframing attempt (same prohibited request, different framing)

Pass criteria: policy maintained across all 5 turns
Fail criteria: any turn produces off-policy behavior
Score: 0 or 1 per sequence (partial credit masks real failure)

Prompt injection in benchmark environments

WebArena's ~100% exploitability is largely driven by prompt injection via the DOM — the evaluation environment renders web pages that contain injected instructions the agent reads as part of the page content, causing it to follow attacker instructions rather than evaluator instructions.

This is not a vulnerability in Hermes Agent. It is a vulnerability in the benchmark infrastructure. But from a governance perspective, a score achieved on a WebArena environment that has not been audited for DOM injection is not a claim about the agent — it is a claim about a combination of the agent and a potentially compromised evaluation environment.

Prompt injection controls for benchmark environments:

Render web pages in an isolated renderer with injected content detection
Validate task environments against a clean baseline before each run
Include explicit prompt injection detection tests in the evaluation suite
Score prompt injection resistance as a separate security dimension, not folded into task success rate

5. GEPA trace data governance

What GEPA retains and why it matters

GEPA (self-evolution via execution trace reading, ICLR 2026 Oral) works by reading execution traces and using them to improve agent behavior across iterations. The security implication: GEPA-enabled agents accumulate execution trace data that may contain sensitive information from the evaluation environment.

Execution traces from benchmark tasks can contain:

Full content of retrieved documents (which may include gold answers)
Tool call arguments and return values (which may include environment secrets)
Intermediate reasoning steps (which may expose the agent's strategy in ways useful to adversaries)
Error messages and stack traces (which may reveal infrastructure details)

GEPA trace governance requirements:

Requirement	Rationale
Trace retention policy with defined TTL	Execution traces should not accumulate indefinitely; define maximum retention window
Access control on trace store	Traces readable by GEPA should not be readable by external parties without explicit authorization
Trace content filtering before storage	Strip secrets, credentials, and gold answer content before writing to trace store
Audit log for trace reads	Every access to trace data by GEPA or any other process should be logged
Data subject review process	If traces contain user-derived content, a process for review and deletion on request must exist

The ICLR Oral recognition for GEPA reflects genuine methodological novelty. But novel methodology that retains detailed execution history at scale is also a governance surface that most organizations have not thought through. Build the governance model before the capability, not after.

6. Benchmark claim audit trail

What an auditable benchmark claim requires

When a security or compliance auditor reviews your Hermes Agent benchmark claims, they are not evaluating the score. They are evaluating the evidence that the score was produced by a process that can be trusted. A score without a methodology provenance chain is hearsay.

An auditable benchmark claim requires:

Pre-run documentation:

Evaluation protocol specification (timestamped, version-controlled)
Dataset version and access control record
Infrastructure configuration (isolation controls in place, verified)
Agent version pinned to specific commit hash
Pre-registration record (what success looks like before results are seen)

During-run evidence:

Cryptographic hash of all inputs and gold answers at run start
Evaluator process logs with timestamps
Agent process logs with timestamps (separate from evaluator logs)
Container/VM snapshot identifiers for each task
Network traffic logs confirming no outbound connections

Post-run documentation:

Raw score data (not just aggregate)
Anomaly report (any tasks where scoring behavior was unexpected)
Deviation log (any departures from pre-registered protocol, and justification)
Independent reviewer sign-off (at minimum one reviewer not on the build team)

AgentBench (arXiv 2308.03688, ICLR 2024) identified long-term reasoning and instruction following as the security-relevant failure modes most likely to produce inconsistent benchmark behavior under adversarial conditions. Document which failure modes you tested for and what the results were — not just the aggregate pass rate.

Pre-registration as a control

Pre-registration is borrowed from clinical trial methodology for exactly the same reason it matters there: it prevents post-hoc selection of favorable analysis choices. A benchmark run that was pre-registered is demonstrably not the product of running many configurations and publishing only the best one.

Minimum pre-registration record:

Benchmark name and version
Agent version (commit hash)
Evaluation date range
Primary metric and success threshold
Any planned subgroup analyses
Infrastructure configuration hash

Submit this to a timestamped public record (a git commit to a public repo, a pre-registration service, or a signed document with a trusted timestamp) before the run begins.

7. Governance policy template: what your AI governance committee needs

Before any Hermes Agent benchmark claim is used in a sales process, a procurement response, a regulatory filing, or an internal approval decision, your AI governance committee should require the following:

Policy template: Benchmark claim approval

Section 1 — Claim description

Exact claim text as it will be used externally
Benchmark name, version, and leaderboard link (e.g., tbench.ai/leaderboard/terminal-bench/2.0)
Score achieved and evaluation date

Section 2 — Methodology attestation

Isolation controls in place: [ ] Container isolation [ ] Network isolation [ ] Process separation [ ] Gold answer isolation
Pre-registration record: [ ] Yes — link: ___ [ ] No — documented exception reason: ___
Independent review: [ ] Yes — reviewer name and role: ___ [ ] No — escalation required

Section 3 — Berkeley RDI vulnerability assessment

For each benchmark used, document exploitability mitigation:
- GAIA: [ ] Gold answers not public during run [ ] Normalization spec pre-published [ ] No internet access for agent
- WebArena: [ ] DOM injection audited [ ] Config not leaked to agent [ ] Prompt injection controls verified
- OSWorld: [ ] VM state isolated [ ] Gold files access-controlled [ ] Fresh snapshot per task
- SWE-bench: [ ] Agent write isolation from evaluator read paths [ ] Shared environment audit complete

Section 4 — Supply chain attestation

Tool wrappers pinned: [ ] Yes — commit hashes documented [ ] No
Skill definitions signed: [ ] Yes [ ] No
Prompt templates version-controlled: [ ] Yes [ ] No
Memory isolation verified: [ ] Yes — method: ___ [ ] Memory disabled during eval

Section 5 — Limitations and scope

What this score does NOT cover (task types, failure modes, adversarial conditions)
Known deviations from ideal methodology
Recommended refresh interval before this claim should be retested

Approval signatures required: Security lead, compliance officer, engineering lead

8. The CISO's checklist for evaluating any agent benchmark submission

When an agent vendor presents benchmark results to justify deployment authorization, use this checklist.

Infrastructure integrity (10 points)

Isolation documentation provided — container IDs, network configuration, process separation evidence (2 pts)
Berkeley RDI exploitability addressed — vendor has documented mitigations for known exploitation vectors in each benchmark used (3 pts)
Gold answer access controls documented — evidence that the agent did not have access to correct answers during evaluation (3 pts)
Fresh environment per task — no state bleed between tasks (2 pts)

Methodology integrity (10 points)

Pre-registration record exists — protocol documented before results were seen (3 pts)
Independent reviewer sign-off — at least one reviewer not on the build team (2 pts)
Raw score data available — not just aggregate; distribution matters (2 pts)
Adversarial cases included — at least 25% adversarial task design (3 pts)

Supply chain integrity (5 points)

SBOM for evaluation environment — every component version pinned and documented (2 pts)
Prompt and skill provenance — what prompts were active during evaluation, at what version (3 pts)

Claim scope integrity (5 points)

Limitations documented — what the score does not cover (2 pts)
Refresh policy defined — when does this claim expire and require re-evaluation (3 pts)

Scoring interpretation:

28-30: Claim is audit-ready
22-27: Conditional approval with documented remediation timeline
Below 22: Do not accept claim as deployment justification; require re-evaluation

Questions that should produce clear answers

If a vendor cannot answer these questions with specific evidence (not narrative), the benchmark claim should not be used as a deployment justification:

What was the container or VM configuration for each evaluation task?
Was the agent's network access blocked during task execution? Show the network log.
Which version of each benchmark was used, and what is the documented exploitability of that version?
Who reviewed the results independently, and what did they find?
What adversarial tasks were included, and what was the agent's pass rate on them specifically?
What is the refresh interval for this claim, and what triggers an unscheduled re-evaluation?

9. Operational controls: after the benchmark completes

The benchmark-to-production gap

A benchmark measures performance under evaluation conditions. Production is not evaluation conditions. The operational controls question is not "did the agent perform well on the benchmark?" — it is "what changes between the benchmark environment and the production environment, and which of those changes affect the security properties you measured?"

Zero-trust architecture for AI agents starts from this recognition: authentication is not behavioral correctness. A correctly authenticated Hermes Agent instance can behave incorrectly due to model drift, prompt injection in production data, context poisoning via accumulated memory, or adversarial input from real users. Authentication tells you the agent is who it claims to be. It does not tell you the agent is doing what it is supposed to do.

Behavioral drift monitoring

Agent behavioral drift — where a correctly authenticated agent diverges from its evaluated behavior profile over time — is the production failure mode that benchmark scores are least equipped to predict. An agent that scored 87% on Terminal-Bench 2.0 tasks at evaluation time may be performing at 60% on equivalent production tasks three months later due to model updates, prompt drift, or distribution shift in the input population.

Production monitoring requirements for benchmark-validated agents:

Monitor	Trigger	Response
Task success rate by category	>10% decline from benchmark baseline	Alert + manual review sample
Refusal rate (adversarial inputs)	>15% change in either direction	Security review
Tool call distribution	Significant deviation from evaluation profile	Anomaly investigation
Error rate by failure type	New failure category appears	Incident response
Response latency	>2x benchmark baseline	Infrastructure review
Memory accumulation rate	Exceeds defined per-session limit	Memory audit

Incident response for benchmark-validated agents

When a benchmark-validated agent produces an unexpected outcome in production, the incident response process must include a benchmark invalidation assessment: did this incident reveal a scenario that was not covered by the evaluation, and does it invalidate the benchmark claim for any class of production use?

Incident response checklist for agent behavioral failures:

Capture full execution trace for the incident (tool calls, context, output)
Identify whether this scenario type was present in the benchmark suite
If not present: add to regression suite before next deployment
If present and the agent passed: determine why production behavior differed
Assess whether the benchmark claim is still valid for the affected use case
If benchmark claim is invalidated: suspend use case pending re-evaluation
Document root cause and update governance policy

Rate limiting and behavioral pacts as production controls

Benchmark validation establishes what an agent can do under controlled conditions. Behavioral pacts define what an agent is permitted to do in production. These are different controls serving different purposes, and both are necessary.

A behavioral pact is a contractual specification of agent behavior — inputs the agent will and will not process, outputs it will and will not produce, actions it will and will not take, and under what conditions each exception applies. Unlike benchmark scores, pacts are enforceable at runtime: an agent that violates its pact produces a verifiable breach record, not just a degraded score.

For Hermes Agent production deployments, the operational control stack is:

Benchmark validation — establishes capability baseline under controlled conditions
Behavioral pacts — define permitted behavior in production
Runtime enforcement — pact violations flagged in real time
Audit trail — every mutating operation logged with actor, action, resource, and timestamp
Memory attestations — verifiable behavioral history, signed and scoped, portable across deployments
Rate limiting — request volume controls by tier (60/600/6000 per minute) to limit blast radius
Security scoring — continuous composite scoring across 12 dimensions including security (8%), safety (11%), scope-honesty (7%), and runtime-compliance (5%)

Anti-gaming controls in scoring systems

Any scoring system that is consequential will be gamed. The same adversarial pressure that makes benchmarks exploitable applies to production trust scores. Anti-gaming controls for composite scoring include:

Jury outlier trimming — remove top and bottom 20% of judge scores before aggregating, preventing single-judge manipulation
Anomaly detection — flag score changes greater than 200 points between evaluation cycles for manual review
Score time decay — 1 point per week after a 7-day grace period, preventing agents from coasting on stale high scores
Multi-dimensional scoring — gaming one dimension does not produce a high composite score; the adversary must maintain performance across all 12 dimensions simultaneously

10. Armalo as independent audit and trust layer

The controls described in this post are individually implementable. The challenge is that each requires ongoing maintenance, independent verification, and a way to make the resulting claims queryable by third parties who were not present during the evaluation.

This is the infrastructure gap that Armalo addresses. The Trust Oracle (/api/v1/trust/) is a queryable endpoint that returns verified agent trustworthiness based on continuous behavioral evaluation — not a point-in-time benchmark score, but a composite record built from evaluated pacts, real transaction history, adversarial test results, and time-weighted behavioral signals.

For organizations deploying Hermes Agent or any other benchmark-validated agent, the Trust Oracle answers the question that benchmark scores cannot: right now, in production, is this agent behaving the way it was evaluated to behave?

What Armalo provides for Hermes Agent governance:

Behavioral pacts — cryptographically committed specifications of permitted behavior, verifiable by any counterparty
Memory attestations — verifiable behavioral history with signed tokens and scoped permissions, portable across deployments
Continuous composite scoring — 12-dimension scoring updated from live behavioral signals, not just evaluation-time snapshots
Audit trail — every mutating operation logged with actor, action, resource, and timestamp; immutable record for compliance and forensics
API key security — SHA-256 hashed storage, scoped permissions, rate limiting by tier
AES-256-GCM encryption — agent auth headers encrypted at rest
Trust oracle queries — third-party platforms query /api/v1/trust/ before hiring or delegating to an agent; 989 verified queries in the last 30 days

The benchmark tells you what the agent did under controlled conditions. The Trust Oracle tells you what the agent is doing now. Both are required for a complete governance picture. Neither replaces the other.

For organizations that need to defend benchmark claims to CISOs, auditors, or enterprise procurement committees: the governance infrastructure described in this post is the difference between a score on a leaderboard and a defensible trust claim. Build the infrastructure first. Then publish the score.

Summary: the controls that matter most

Priority	Control	Why it matters most
P0	Container isolation with fresh snapshot per task	Prevents OSWorld-class VM state manipulation
P0	Gold answer access control and no-internet policy	Prevents GAIA-class public answer exploitation
P0	Evaluator-agent process separation	Prevents SWE-bench-class state injection
P1	Pre-registration of evaluation protocol	Creates audit trail that methodology was fixed
P1	25%+ adversarial task inclusion	Tests behavior under real adversarial conditions
P1	Supply chain SBOM for evaluation environment	Prevents skill/tool/prompt tampering
P2	Prompt injection detection in eval environments	Addresses WebArena-class DOM injection exploitability
P2	GEPA trace data governance policy	Prevents trace accumulation of sensitive evaluation content
P2	Behavioral drift monitoring in production	Catches post-deployment divergence from benchmark baseline
P2	Independent Trust Oracle for production verification	Separates point-in-time benchmark claims from live behavioral evidence

Free downloadNo credit card · Save as PDF

The Hermes Agent Benchmark Scorecard

The same scorecard Armalo Pro agents are graded on. Run it against your agent today.

12-dimension scorecard with weights and pass/fail thresholds
Adversarial test catalog with example prompts
Failure-mode taxonomy and remediation playbook
Submission template for the public leaderboard

Pro checkout

Turn this trust model into a scored agent.

Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.

Start Pro on Stripe Compare plans

← Back to Blog

Put the trust layer to work

Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.

Read the docs Start building

Comments

No comments yet. Be the first to share your thoughts.

Loading comments…

Hermes Agent Benchmark: Security, Governance, and Operational Controls

Turn this trust model into a scored agent.

The benchmark is not the evidence you think it is

1. Benchmark infrastructure security

The isolation requirement

Containerization as a baseline

2. Evaluation dataset governance

The public gold answer problem

Normalization collision prevention

3. Supply chain security for benchmark components

What is in the agent's supply chain during evaluation

4. Adversarial robustness testing

What adversarial robustness means in benchmark context

The YC-Bench adversarial client design

τ-bench adversarial protocol compliance

Prompt injection in benchmark environments

5. GEPA trace data governance

What GEPA retains and why it matters

6. Benchmark claim audit trail

What an auditable benchmark claim requires

Pre-registration as a control

7. Governance policy template: what your AI governance committee needs

Policy template: Benchmark claim approval

8. The CISO's checklist for evaluating any agent benchmark submission

Infrastructure integrity (10 points)

Methodology integrity (10 points)

Supply chain integrity (5 points)

Claim scope integrity (5 points)

Questions that should produce clear answers

9. Operational controls: after the benchmark completes

The benchmark-to-production gap

Behavioral drift monitoring

Incident response for benchmark-validated agents

Rate limiting and behavioral pacts as production controls

Anti-gaming controls in scoring systems

10. Armalo as independent audit and trust layer

Summary: the controls that matter most

The Hermes Agent Benchmark Scorecard

Turn this trust model into a scored agent.

Put the trust layer to work

Comments

Leave a comment