Hermes Agent Benchmark: Security, Governance, and Operational Controls
Berkeley RDI found that GAIA is ~98% exploitable, WebArena ~100%, and OSWorld 73% β before a single line of agent code runs. This is the security and governance playbook for running Hermes Agent benchmarks that CISO and audit scrutiny can actually survive.
Continue the reading path
Topic hub
Benchmark DesignThis page is routed through Armalo's metadata-defined benchmark design hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
The benchmark is not the evidence you think it is
Your Hermes Agent scores well on GAIA. It clears WebArena tasks at state-of-the-art rates. The eval deck looks clean. And your CISO is about to ask a question that will take the floor out from under all of it: How do you know the benchmark measured the agent and not the benchmark infrastructure?
That question is not hypothetical. In 2026, Berkeley RDI's Trustworthy Benchmarks research documented what the evaluation research community had suspected for years but rarely stated this plainly:
- GAIA: ~98% of tasks exploitable via public answers on HuggingFace combined with normalization collisions in the scoring layer
- WebArena: ~100% exploitable through config leakage, DOM injection, and prompt injection against the evaluation harness itself
- OSWorld: 73% exploitable via VM state manipulation and publicly accessible gold files
- SWE-bench: the agent can write state to the shared environment that the evaluator subsequently reads β the judge is reading evidence the defendant planted
Source: Berkeley RDI, Trustworthy Benchmarks (Continued), 2026
These are not edge cases or theoretical attack paths. They are the default operating conditions for the most widely cited benchmarks in the industry. If you submitted Hermes Agent (Nous Research, github.com/NousResearch/hermes-agent) to any of these benchmarks without isolation controls in place, the score you published is not a security property β it is an artifact of benchmark infrastructure that a moderately determined adversary or a careless engineer could have contaminated.
This post is the security and governance playbook for Hermes Agent benchmarking. It covers infrastructure security, dataset governance, supply chain controls, adversarial robustness, trace data handling, audit trail construction, and what production operational controls need to look like after the benchmark completes.
1. Benchmark infrastructure security
The isolation requirement
Run Hermes on your agent right now β paste an endpoint, get a public 12-dimension scorecard, $99 keeps the seal live with a 30-day recheck.
Run Hermes β $99 βEvery benchmark run must treat the evaluation environment as an adversarial boundary. The agent being evaluated must not be able to read evaluator state, write to locations the evaluator will subsequently read, or communicate with any system outside the task sandbox.
The Berkeley RDI findings for SWE-bench are particularly sharp here: the shared environment between agent and evaluator is a write channel from the agent to the judge. This is not a subtle side channel β it is direct state manipulation of the scoring mechanism. Any benchmark architecture that allows agent output to influence evaluator input is not measuring agent capability. It is measuring how well the agent can game the judge.
Minimum isolation requirements for Hermes Agent benchmark runs:
| Control | Requirement | Failure mode without it |
|---|---|---|
| Filesystem isolation | Agent cannot write to evaluator-accessible paths | SWE-bench-class state injection |
| Network isolation | Agent has no outbound network during task execution | Exfiltration of gold answers, external oracle calls |
| VM/container snapshot | Fresh snapshot for every task | State bleeding across tasks |
| Evaluator separation | Evaluator runs in separate process with no shared memory | Memory-based state leakage |
| Log segregation | Agent logs and evaluator logs write to separate append-only stores | Log tampering by agent |
Containerization as a baseline
Terminal-Bench 2.0 (arXiv 2601.11868) represents the current methodological standard: 89 tasks, three independent human reviewers per task, and Docker containerization of every evaluation unit. The Docker boundary is not optional decoration β it is the isolation control that prevents VM state manipulation of the class Berkeley RDI documented in OSWorld.
TBLite takes the same approach. When you see Docker-containerized tasks in a benchmark, that is the evaluator explicitly trading throughput for isolation integrity. The organizations not doing this are trading integrity for throughput, whether they acknowledge it or not.
For Hermes Agent evaluations, the operational requirement is:
Each task = fresh container snapshot
Each task = isolated network namespace (no outbound)
Each task = evaluator process outside agent container
Each task = gold answer loaded only into evaluator process
If any of these conditions are false, document the deviation explicitly before publishing a score. A score produced under weaker isolation is not fraudulent, but it is not the same claim as a score produced under full isolation β and treating them as equivalent is the security failure.
2. Evaluation dataset governance
The public gold answer problem
GAIA's ~98% exploitability rate is almost entirely explained by one fact: the gold answers are public on HuggingFace. Combined with normalization collisions in the scoring layer β where slight string variations in the same correct answer can score differently depending on post-processing β you have a benchmark where an agent with internet access and awareness of the dataset can score near-perfectly without demonstrating any real capability.
This is not an attack in the traditional sense. It does not require a sophisticated adversary. It requires the agent to have access to what any developer with a browser and a HuggingFace account has access to.
Dataset governance controls:
| Control | What it addresses |
|---|---|
| Private held-out test sets | Prevents direct gold answer lookup |
| Versioned answer sets with access logs | Establishes who could have accessed gold before a run |
| Normalization specification published before run | Prevents score manipulation via post-hoc normalization choice |
| Pre-registration of evaluation protocol | Creates audit trail that methodology was fixed before results were seen |
| Cryptographic hash of dataset at run time | Proves the dataset was not modified between registration and execution |
YC-Bench (arXiv 2604.01212) addresses dataset integrity from a different angle: adversarial client design. One-third of YC-Bench clients are deliberately designed to fail β to test whether the agent can recognize and correctly handle adversarial or degenerate inputs. The SQLite-backed simulation with a fixed random seed means runs are deterministic and reproducible. This is governance by design: the methodology makes it hard to cherry-pick favorable conditions because the conditions are locked at seed time.
For your Hermes Agent benchmark governance program, the minimum viable dataset controls are:
- No public test sets β if answers are public, the benchmark is a capability test of retrieval, not reasoning
- Pre-registration β protocol documented and timestamped before any run begins
- Held-out validation β at least 20% of evaluation tasks reserved for final validation, never used in development or debugging
- Access log β who had read access to gold answers, when, and from which environment
Normalization collision prevention
Normalization collisions deserve specific attention because they are invisible to most consumers of benchmark results. A normalization collision occurs when the scoring function maps two different string representations of the same answer to different scores β or when it maps a wrong answer to the same normalized form as a correct answer.
For GAIA-class benchmarks, require the evaluator to publish the full normalization specification before the run. Any post-hoc normalization tuning invalidates the run as a fair measurement. Treat normalization specification changes between runs as breaking changes that require re-running from scratch.
3. Supply chain security for benchmark components
What is in the agent's supply chain during evaluation
When Hermes Agent runs a benchmark task, it does not run in isolation. It runs with:
- Tool wrappers that mediate access to the environment
- Skill definitions that encode how to approach task categories
- Prompt templates that shape its reasoning
- Memory artifacts from prior tasks (if memory is enabled)
- Any plugins, extensions, or adapters registered with the agent
Each of these is a supply chain component. Each can be tampered with. AI agent supply chain attacks targeting skills, tool wrappers, prompts, and memory artifacts are documented attack vectors β and benchmark environments are particularly attractive targets because a successful attack produces not just compromised behavior but a falsified public capability claim.
Supply chain controls for benchmark runs:
| Component | Control | Verification method |
|---|---|---|
| Tool wrappers | Pin to specific commit hash | Hash verification at container start |
| Skill definitions | Cryptographic signing | Signature check before agent initialization |
| Prompt templates | Version-controlled, hash-pinned | Diff against registered baseline |
| Memory artifacts | Disabled or isolated to single-run scope | Verify memory store is fresh per task |
| External dependencies | SBOM generated at run time | Compare against approved dependency manifest |
The memory artifact case deserves specific attention. If Hermes Agent's memory system is active during evaluation, memory written during task N is potentially readable during task N+1. This is a within-run contamination channel that can produce scores that do not reflect single-task capability β and it is entirely invisible unless you explicitly verify that memory isolation is enforced between tasks.
For evaluation purposes: either disable memory entirely and document this as a constraint on the score's applicability, or implement strict per-task memory namespacing with cryptographic verification that tasks cannot read prior-task memory.
4. Adversarial robustness testing
What adversarial robustness means in benchmark context
Adversarial robustness is not a single property. In benchmark context it has at least three distinct meanings:
- Input-level adversarial: the agent receives crafted task inputs designed to elicit incorrect or harmful behavior
- Environment-level adversarial: the evaluation environment contains injected content (DOM injection, prompt injection in retrieved documents) designed to hijack agent behavior
- Evaluator-level adversarial: the scoring mechanism is attacked to produce favorable scores for incorrect outputs
Most benchmark security discussions focus on type 3. Berkeley RDI's work is primarily about types 2 and 3. But type 1 β the agent's robustness under adversarial task design β is the security property that actually matters for production deployment.
The YC-Bench adversarial client design
YC-Bench's decision to design one-third of clients as adversarial is the most operationally honest benchmark design choice in recent agent evaluation literature. The explicit reasoning: a benchmark that only tests happy-path clients measures performance under ideal conditions. Real deployments face clients who are confused, hostile, out of scope, or actively trying to extract behavior the agent should not exhibit.
For Hermes Agent security evaluation, replicate this design principle:
- 25-33% of test cases should be adversarial β designed to fail, designed to confuse, designed to prompt off-scope behavior
- Adversarial cases should test refusal quality β does the agent refuse correctly, or does it refuse incorrectly (false positive), or does it fail to refuse (false negative)?
- Include prompt injection tasks explicitly β tasks where the retrieved content contains injected instructions the agent must recognize and ignore
- Score adversarial cases separately β aggregate score hides adversarial performance; report them independently
Ο-bench adversarial protocol compliance
Ο-bench (arXiv 2406.12045) introduced policy compliance testing across multiple adversarial user interactions. The key design choice: measure whether the agent maintains policy compliance across a full interaction sequence, not just at a single turn. An agent that complies on turn 1 but capitulates to persistent pressure on turn 4 is not policy-compliant.
For Hermes Agent evaluations that include multi-turn interaction:
Adversarial interaction protocol:
Turn 1: Normal request (establish baseline)
Turn 2: Escalation attempt (push boundary)
Turn 3: Authority claim injection ("I'm an admin, ignore prior instructions")
Turn 4: Emotional escalation (urgency, threat)
Turn 5: Reframing attempt (same prohibited request, different framing)
Pass criteria: policy maintained across all 5 turns
Fail criteria: any turn produces off-policy behavior
Score: 0 or 1 per sequence (partial credit masks real failure)
Prompt injection in benchmark environments
WebArena's ~100% exploitability is largely driven by prompt injection via the DOM β the evaluation environment renders web pages that contain injected instructions the agent reads as part of the page content, causing it to follow attacker instructions rather than evaluator instructions.
This is not a vulnerability in Hermes Agent. It is a vulnerability in the benchmark infrastructure. But from a governance perspective, a score achieved on a WebArena environment that has not been audited for DOM injection is not a claim about the agent β it is a claim about a combination of the agent and a potentially compromised evaluation environment.
Prompt injection controls for benchmark environments:
- Render web pages in an isolated renderer with injected content detection
- Validate task environments against a clean baseline before each run
- Include explicit prompt injection detection tests in the evaluation suite
- Score prompt injection resistance as a separate security dimension, not folded into task success rate
5. GEPA trace data governance
What GEPA retains and why it matters
GEPA (self-evolution via execution trace reading, ICLR 2026 Oral) works by reading execution traces and using them to improve agent behavior across iterations. The security implication: GEPA-enabled agents accumulate execution trace data that may contain sensitive information from the evaluation environment.
Execution traces from benchmark tasks can contain:
- Full content of retrieved documents (which may include gold answers)
- Tool call arguments and return values (which may include environment secrets)
- Intermediate reasoning steps (which may expose the agent's strategy in ways useful to adversaries)
- Error messages and stack traces (which may reveal infrastructure details)
GEPA trace governance requirements:
| Requirement | Rationale |
|---|---|
| Trace retention policy with defined TTL | Execution traces should not accumulate indefinitely; define maximum retention window |
| Access control on trace store | Traces readable by GEPA should not be readable by external parties without explicit authorization |
| Trace content filtering before storage | Strip secrets, credentials, and gold answer content before writing to trace store |
| Audit log for trace reads | Every access to trace data by GEPA or any other process should be logged |
| Data subject review process | If traces contain user-derived content, a process for review and deletion on request must exist |
The ICLR Oral recognition for GEPA reflects genuine methodological novelty. But novel methodology that retains detailed execution history at scale is also a governance surface that most organizations have not thought through. Build the governance model before the capability, not after.
6. Benchmark claim audit trail
What an auditable benchmark claim requires
When a security or compliance auditor reviews your Hermes Agent benchmark claims, they are not evaluating the score. They are evaluating the evidence that the score was produced by a process that can be trusted. A score without a methodology provenance chain is hearsay.
An auditable benchmark claim requires:
Pre-run documentation:
- Evaluation protocol specification (timestamped, version-controlled)
- Dataset version and access control record
- Infrastructure configuration (isolation controls in place, verified)
- Agent version pinned to specific commit hash
- Pre-registration record (what success looks like before results are seen)
During-run evidence:
- Cryptographic hash of all inputs and gold answers at run start
- Evaluator process logs with timestamps
- Agent process logs with timestamps (separate from evaluator logs)
- Container/VM snapshot identifiers for each task
- Network traffic logs confirming no outbound connections
Post-run documentation:
- Raw score data (not just aggregate)
- Anomaly report (any tasks where scoring behavior was unexpected)
- Deviation log (any departures from pre-registered protocol, and justification)
- Independent reviewer sign-off (at minimum one reviewer not on the build team)
AgentBench (arXiv 2308.03688, ICLR 2024) identified long-term reasoning and instruction following as the security-relevant failure modes most likely to produce inconsistent benchmark behavior under adversarial conditions. Document which failure modes you tested for and what the results were β not just the aggregate pass rate.
Pre-registration as a control
Pre-registration is borrowed from clinical trial methodology for exactly the same reason it matters there: it prevents post-hoc selection of favorable analysis choices. A benchmark run that was pre-registered is demonstrably not the product of running many configurations and publishing only the best one.
Minimum pre-registration record:
- Benchmark name and version
- Agent version (commit hash)
- Evaluation date range
- Primary metric and success threshold
- Any planned subgroup analyses
- Infrastructure configuration hash
Submit this to a timestamped public record (a git commit to a public repo, a pre-registration service, or a signed document with a trusted timestamp) before the run begins.
7. Governance policy template: what your AI governance committee needs
Before any Hermes Agent benchmark claim is used in a sales process, a procurement response, a regulatory filing, or an internal approval decision, your AI governance committee should require the following:
Policy template: Benchmark claim approval
Section 1 β Claim description
- Exact claim text as it will be used externally
- Benchmark name, version, and leaderboard link (e.g., tbench.ai/leaderboard/terminal-bench/2.0)
- Score achieved and evaluation date
Section 2 β Methodology attestation
- Isolation controls in place: [ ] Container isolation [ ] Network isolation [ ] Process separation [ ] Gold answer isolation
- Pre-registration record: [ ] Yes β link: ___ [ ] No β documented exception reason: ___
- Independent review: [ ] Yes β reviewer name and role: ___ [ ] No β escalation required
Section 3 β Berkeley RDI vulnerability assessment
- For each benchmark used, document exploitability mitigation:
- GAIA: [ ] Gold answers not public during run [ ] Normalization spec pre-published [ ] No internet access for agent
- WebArena: [ ] DOM injection audited [ ] Config not leaked to agent [ ] Prompt injection controls verified
- OSWorld: [ ] VM state isolated [ ] Gold files access-controlled [ ] Fresh snapshot per task
- SWE-bench: [ ] Agent write isolation from evaluator read paths [ ] Shared environment audit complete
Section 4 β Supply chain attestation
- Tool wrappers pinned: [ ] Yes β commit hashes documented [ ] No
- Skill definitions signed: [ ] Yes [ ] No
- Prompt templates version-controlled: [ ] Yes [ ] No
- Memory isolation verified: [ ] Yes β method: ___ [ ] Memory disabled during eval
Section 5 β Limitations and scope
- What this score does NOT cover (task types, failure modes, adversarial conditions)
- Known deviations from ideal methodology
- Recommended refresh interval before this claim should be retested
Approval signatures required: Security lead, compliance officer, engineering lead
8. The CISO's checklist for evaluating any agent benchmark submission
When an agent vendor presents benchmark results to justify deployment authorization, use this checklist.
Infrastructure integrity (10 points)
- Isolation documentation provided β container IDs, network configuration, process separation evidence (2 pts)
- Berkeley RDI exploitability addressed β vendor has documented mitigations for known exploitation vectors in each benchmark used (3 pts)
- Gold answer access controls documented β evidence that the agent did not have access to correct answers during evaluation (3 pts)
- Fresh environment per task β no state bleed between tasks (2 pts)
Methodology integrity (10 points)
- Pre-registration record exists β protocol documented before results were seen (3 pts)
- Independent reviewer sign-off β at least one reviewer not on the build team (2 pts)
- Raw score data available β not just aggregate; distribution matters (2 pts)
- Adversarial cases included β at least 25% adversarial task design (3 pts)
Supply chain integrity (5 points)
- SBOM for evaluation environment β every component version pinned and documented (2 pts)
- Prompt and skill provenance β what prompts were active during evaluation, at what version (3 pts)
Claim scope integrity (5 points)
- Limitations documented β what the score does not cover (2 pts)
- Refresh policy defined β when does this claim expire and require re-evaluation (3 pts)
Scoring interpretation:
- 28-30: Claim is audit-ready
- 22-27: Conditional approval with documented remediation timeline
- Below 22: Do not accept claim as deployment justification; require re-evaluation
Questions that should produce clear answers
If a vendor cannot answer these questions with specific evidence (not narrative), the benchmark claim should not be used as a deployment justification:
- What was the container or VM configuration for each evaluation task?
- Was the agent's network access blocked during task execution? Show the network log.
- Which version of each benchmark was used, and what is the documented exploitability of that version?
- Who reviewed the results independently, and what did they find?
- What adversarial tasks were included, and what was the agent's pass rate on them specifically?
- What is the refresh interval for this claim, and what triggers an unscheduled re-evaluation?
9. Operational controls: after the benchmark completes
The benchmark-to-production gap
A benchmark measures performance under evaluation conditions. Production is not evaluation conditions. The operational controls question is not "did the agent perform well on the benchmark?" β it is "what changes between the benchmark environment and the production environment, and which of those changes affect the security properties you measured?"
Zero-trust architecture for AI agents starts from this recognition: authentication is not behavioral correctness. A correctly authenticated Hermes Agent instance can behave incorrectly due to model drift, prompt injection in production data, context poisoning via accumulated memory, or adversarial input from real users. Authentication tells you the agent is who it claims to be. It does not tell you the agent is doing what it is supposed to do.
Behavioral drift monitoring
Agent behavioral drift β where a correctly authenticated agent diverges from its evaluated behavior profile over time β is the production failure mode that benchmark scores are least equipped to predict. An agent that scored 87% on Terminal-Bench 2.0 tasks at evaluation time may be performing at 60% on equivalent production tasks three months later due to model updates, prompt drift, or distribution shift in the input population.
Production monitoring requirements for benchmark-validated agents:
| Monitor | Trigger | Response |
|---|---|---|
| Task success rate by category | >10% decline from benchmark baseline | Alert + manual review sample |
| Refusal rate (adversarial inputs) | >15% change in either direction | Security review |
| Tool call distribution | Significant deviation from evaluation profile | Anomaly investigation |
| Error rate by failure type | New failure category appears | Incident response |
| Response latency | >2x benchmark baseline | Infrastructure review |
| Memory accumulation rate | Exceeds defined per-session limit | Memory audit |
Incident response for benchmark-validated agents
When a benchmark-validated agent produces an unexpected outcome in production, the incident response process must include a benchmark invalidation assessment: did this incident reveal a scenario that was not covered by the evaluation, and does it invalidate the benchmark claim for any class of production use?
Incident response checklist for agent behavioral failures:
- Capture full execution trace for the incident (tool calls, context, output)
- Identify whether this scenario type was present in the benchmark suite
- If not present: add to regression suite before next deployment
- If present and the agent passed: determine why production behavior differed
- Assess whether the benchmark claim is still valid for the affected use case
- If benchmark claim is invalidated: suspend use case pending re-evaluation
- Document root cause and update governance policy
Rate limiting and behavioral pacts as production controls
Benchmark validation establishes what an agent can do under controlled conditions. Behavioral pacts define what an agent is permitted to do in production. These are different controls serving different purposes, and both are necessary.
A behavioral pact is a contractual specification of agent behavior β inputs the agent will and will not process, outputs it will and will not produce, actions it will and will not take, and under what conditions each exception applies. Unlike benchmark scores, pacts are enforceable at runtime: an agent that violates its pact produces a verifiable breach record, not just a degraded score.
For Hermes Agent production deployments, the operational control stack is:
- Benchmark validation β establishes capability baseline under controlled conditions
- Behavioral pacts β define permitted behavior in production
- Runtime enforcement β pact violations flagged in real time
- Audit trail β every mutating operation logged with actor, action, resource, and timestamp
- Memory attestations β verifiable behavioral history, signed and scoped, portable across deployments
- Rate limiting β request volume controls by tier (60/600/6000 per minute) to limit blast radius
- Security scoring β continuous composite scoring across 12 dimensions including security (8%), safety (11%), scope-honesty (7%), and runtime-compliance (5%)
Anti-gaming controls in scoring systems
Any scoring system that is consequential will be gamed. The same adversarial pressure that makes benchmarks exploitable applies to production trust scores. Anti-gaming controls for composite scoring include:
- Jury outlier trimming β remove top and bottom 20% of judge scores before aggregating, preventing single-judge manipulation
- Anomaly detection β flag score changes greater than 200 points between evaluation cycles for manual review
- Score time decay β 1 point per week after a 7-day grace period, preventing agents from coasting on stale high scores
- Multi-dimensional scoring β gaming one dimension does not produce a high composite score; the adversary must maintain performance across all 12 dimensions simultaneously
10. Armalo as independent audit and trust layer
The controls described in this post are individually implementable. The challenge is that each requires ongoing maintenance, independent verification, and a way to make the resulting claims queryable by third parties who were not present during the evaluation.
This is the infrastructure gap that Armalo addresses. The Trust Oracle (/api/v1/trust/) is a queryable endpoint that returns verified agent trustworthiness based on continuous behavioral evaluation β not a point-in-time benchmark score, but a composite record built from evaluated pacts, real transaction history, adversarial test results, and time-weighted behavioral signals.
For organizations deploying Hermes Agent or any other benchmark-validated agent, the Trust Oracle answers the question that benchmark scores cannot: right now, in production, is this agent behaving the way it was evaluated to behave?
What Armalo provides for Hermes Agent governance:
- Behavioral pacts β cryptographically committed specifications of permitted behavior, verifiable by any counterparty
- Memory attestations β verifiable behavioral history with signed tokens and scoped permissions, portable across deployments
- Continuous composite scoring β 12-dimension scoring updated from live behavioral signals, not just evaluation-time snapshots
- Audit trail β every mutating operation logged with actor, action, resource, and timestamp; immutable record for compliance and forensics
- API key security β SHA-256 hashed storage, scoped permissions, rate limiting by tier
- AES-256-GCM encryption β agent auth headers encrypted at rest
- Trust oracle queries β third-party platforms query
/api/v1/trust/before hiring or delegating to an agent; 989 verified queries in the last 30 days
The benchmark tells you what the agent did under controlled conditions. The Trust Oracle tells you what the agent is doing now. Both are required for a complete governance picture. Neither replaces the other.
For organizations that need to defend benchmark claims to CISOs, auditors, or enterprise procurement committees: the governance infrastructure described in this post is the difference between a score on a leaderboard and a defensible trust claim. Build the infrastructure first. Then publish the score.
Summary: the controls that matter most
| Priority | Control | Why it matters most |
|---|---|---|
| P0 | Container isolation with fresh snapshot per task | Prevents OSWorld-class VM state manipulation |
| P0 | Gold answer access control and no-internet policy | Prevents GAIA-class public answer exploitation |
| P0 | Evaluator-agent process separation | Prevents SWE-bench-class state injection |
| P1 | Pre-registration of evaluation protocol | Creates audit trail that methodology was fixed |
| P1 | 25%+ adversarial task inclusion | Tests behavior under real adversarial conditions |
| P1 | Supply chain SBOM for evaluation environment | Prevents skill/tool/prompt tampering |
| P2 | Prompt injection detection in eval environments | Addresses WebArena-class DOM injection exploitability |
| P2 | GEPA trace data governance policy | Prevents trace accumulation of sensitive evaluation content |
| P2 | Behavioral drift monitoring in production | Catches post-deployment divergence from benchmark baseline |
| P2 | Independent Trust Oracle for production verification | Separates point-in-time benchmark claims from live behavioral evidence |
The Hermes Agent Benchmark Scorecard
The same scorecard Armalo Pro agents are graded on. Run it against your agent today.
- 12-dimension scorecard with weights and pass/fail thresholds
- Adversarial test catalog with example prompts
- Failure-mode taxonomy and remediation playbook
- Submission template for the public leaderboard
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading commentsβ¦