Research

Hermes Agent Benchmark: Metrics, Scorecards, and Review Cadence

2026-04-1418 minArmalo Team

The specific Prometheus and W&B metrics that matter for Hermes Agent benchmarking, how to build scorecards across development and production stages, and how to set review cadences that detect behavioral drift before it becomes an incident.

Continue the reading path

Topic hub

Agent Evaluation

This page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.

Strategic Guide

Agent Evaluation Framework

Curated Collection

Evaluation Blueprints

Pro checkout

Turn this trust model into a scored agent.

Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.

Start Pro on Stripe Compare plans

Hermes Agent Benchmarking: Metrics, Scorecards, and Review Cadences That Catch Drift Before It Causes Incidents

Most teams deploying Hermes Agent treat benchmarking as a one-time gate at deployment. Run the eval, record the score, ship. Three months later, cost per task has doubled, a subtle behavioral change has compounded into a support incident, and nobody has a systematic record of when things started going wrong.

This post is a practical guide to doing it differently: the specific metrics Hermes exposes, what those metrics actually measure in production, how to build scorecards that serve engineering, procurement, security, and executive audiences, and how to set a review cadence that catches drift before it becomes an incident.

The Benchmark Landscape: What You're Actually Measuring

Before building scorecards, you need to understand what the major benchmarks test — and what they systematically omit.

Run Hermes on your agent right now — paste an endpoint, get a public 12-dimension scorecard, $99 keeps the seal live with a 30-day recheck.

Run Hermes — $99 →

The Three Active Tracks

TBLite (100 tasks) is the fast iteration proxy. A hundred deterministic tasks covering narrow skill domains: file manipulation, API calls, structured output, context recall. Useful for catching regressions during development. Useless as a standalone deployment gate. TBLite scores correlate poorly with production performance on multi-step, ambiguous, or open-ended tasks.

YC-Bench (arXiv 2604.01212) is the financial autonomy benchmark: give an agent a starting capital allocation and measure whether it ends with more. The cost data from YC-Bench is worth examining closely: $1.27M Claude Opus 4.6 and $1.21M GLM-5 represent the token cost to run the benchmark across the full model suite. Only 3 of the 12 tested models exceeded starting capital, making this one of the harshest real-world filters available. YC-Bench requires a minimum of 3 seeds per model — single-seed results on this benchmark should be treated as noise.

Terminal-Bench 2.0 (arXiv 2601.11868) tests 89 terminal-environment tasks with 3 human reviewers per task. Claude Mythos Preview scores 82%. The three-reviewer design matters: it surfaces inter-rater disagreement and catches tasks where the correct answer is genuinely ambiguous. When you adapt Terminal-Bench tasks to your own environment, keep the three-reviewer structure — single-reviewer evals reliably overestimate agent capability.

The τ-bench Reliability Problem

τ-bench (arXiv 2406.12045, Sierra Research) introduced the pass^k metric, which measures the probability that an agent completes a task successfully on all k consecutive attempts. This is the most important single metric for understanding production reliability, and most teams skip it entirely.

The math is brutal. An agent with a 70% single-pass success rate has a pass^8 reliability of:

pass^8 = 0.70^8 = 0.057 (5.7%)

For gpt-4o in retail scenarios, τ-bench found single-pass rates below 50%, which means pass^8 drops below 0.4%. That is not a production-viable agent for any task that requires consistent execution.

The formula for calculating pass^k from a known single-pass rate p:

pass^k = p^k

For a target production reliability of 95% on eight consecutive runs, the minimum required single-pass rate is:

p_min = 0.95^(1/8) = 0.9936 (99.4%)

Almost no agents clear this bar today. The practical implication is that any agent handling workflows with multi-step execution dependencies needs either human-in-the-loop checkpoints or formal retry/escalation logic wired into the pact.

Benchmark Vulnerabilities You Need to Know

Berkeley RDI research found that GAIA is exploitable in 98% of tasks, WebArena in approximately 100%, and OSWorld in 73%. This means benchmark scores on these evaluations are partially a measure of how aggressively the agent exploits evaluation artifacts rather than genuine capability.

Specific distortions to track:

SWE-bench: 7.7% of Lite and 5.2% of Verified tasks have test validity issues. Claude Opus 4.7 scores 87.6% — but that score is against a benchmark with known test validity problems. GAIA shows human performance at 92% versus GPT-4+plugins at 15%, suggesting the gap between human and agent capability is real, but the absolute numbers require context.
WebArena: Overestimates agent performance by 5.2% due to string matching artifacts. Human baseline is 78.24%. Any WebArena score should be deflated by at least 5 percentage points for honest comparison.
AgentBench (arXiv 2308.03688, Tsinghua, ICLR 2024): Identifies the primary capability bottlenecks as long-term reasoning, decision-making, and instruction following — not tool use or API integration. This is the correct decomposition for diagnosing why an agent underperforms on complex tasks.

Hermes Agent Metrics: What the Prometheus and W&B Surfaces Tell You

Hermes Agent exposes six primary metrics via Prometheus and Weights & Biases. Here is what each metric actually measures and what changes should trigger action.

Metric Definitions

Metric	Definition	Collection Surface	Signal Value
`skill_efficiency_score`	Tasks completed per hour, normalized by task complexity tier	Prometheus gauge	Baseline regression detection
`memory_retrieval_accuracy`	Percentage of relevant memories fetched per task, measured by human-labeled relevance sample	W&B + custom eval	Context quality; predicts multi-step coherence
`self_modification_success_rate`	Ratio of accepted to total patches generated by the optimizer module	Prometheus counter	GEPA cycle health; tracks self-improvement trajectory
`success_rate_by_task_type`	Pass/fail by task category (structured output, tool call, long-context, adversarial, etc.)	W&B sweep	Capability gap identification
`improvement_cycles_needed`	Number of GEPA cycles required to converge on target performance	W&B run summary	Iteration efficiency; detects plateau signals
`token_cost_per_execution`	Total tokens consumed per task execution including reflection and optimizer passes	Prometheus histogram	Cost control; catch cost drift before it compounds

The Token Overhead Reality

Hermes Agent's reflection and optimizer modules add 15–25% token overhead to every execution. This is not a bug — it is the mechanism by which GEPA cycles improve performance. The overhead is worth it: Nous Research internal data shows that after 20+ GEPA cycles, task completion speed improves by 40% and accuracy gains are substantial.

The GEPA architecture (ICLR 2026 Oral) achieves 35× fewer rollouts compared to GRPO while delivering +6% average improvement and +20% on specific task types, with MATH benchmark performance at 93% versus GRPO's 67%. The token overhead from reflection is the price of those gains.

However, the cost asymmetry across task types is extreme. Hermes agents can make 2,000+ API calls per complex task, with a 50× cost variation between the cheapest and most expensive tasks ($0.10 to $5.00 per task). Most published benchmarks omit cost data entirely. Your scorecard must include it.

Building the Scorecard: Stage-by-Stage Framework

Stage 1: Development Scorecard

During development, you are tracking whether the agent can reach a performance floor, not whether it is production-ready. The development scorecard focuses on capability across task types and GEPA trajectory.

Development Scorecard Template:

Dimension	Current Score	Target	Delta	Last Reviewed	Trend
TBLite success rate	—	≥85%	—	—	—
pass^1 by task type: structured output	—	≥90%	—	—	—
pass^1 by task type: tool call	—	≥85%	—	—	—
pass^1 by task type: long-context	—	≥75%	—	—	—
pass^1 by task type: adversarial	—	≥60%	—	—	—
Memory retrieval accuracy	—	≥80%	—	—	—
Token cost per execution (p50)	—	<$0.50	—	—	—
Token cost per execution (p95)	—	<$2.00	—	—	—
GEPA improvement cycles to target	—	<15	—	—	—
Self-modification success rate	—	≥70%	—	—	—

The GEPA trajectory is the leading indicator in development. An agent that requires more than 20 cycles to reach target performance on TBLite tasks is probably misconfigured at the architecture level, not just undertrained. Investigate memory retrieval accuracy first — low retrieval accuracy forces the optimizer to relearn context it has already processed.

Stage 2: Pre-Deployment Scorecard

Pre-deployment validation adds three things the development scorecard omits: pass^k reliability (not just pass^1), cost-adjusted scoring, and benchmark freshness tracking.

Computing Cost-Normalized Success Rate

Raw success rate hides cost asymmetry. The cost-normalized success rate formula:

CNSR = (success_rate × task_value) / token_cost_per_execution

Where task_value is the dollar value of a successful task completion, normalized to a common unit (e.g., $1.00 for baseline tasks). A 90% success rate at $4.50 per task execution is worse than an 85% success rate at $0.80 per task execution for any workflow where task value is $5.00 or less.

For production gating, require CNSR ≥ 1.5 before deployment (successful task value is at least 1.5× execution cost at the p50 cost point).

Computing pass^k for Reliability Assessment

For any workflow that requires k consecutive successful executions without human intervention:

pass^k = (pass^1)^k

Example: a customer data migration agent with pass^1 = 0.88 running an 8-step pipeline:

pass^8 = 0.88^8 = 0.393 (39.3%)

That agent will fail 60% of full pipeline runs. Before deployment, the team must either add human checkpoints at high-failure steps, improve pass^1 to ≥0.994 (which yields pass^8 ≥ 0.95), or redesign the pipeline into shorter atomic tasks.

Pre-Deployment Scorecard Template:

Dimension	Score	Gate Threshold	pass^k (k=8)	Cost (p50/p95)	Benchmark Date	Status
Overall pass^1	—	≥90%	—	—	—	—
pass^8 (pipeline reliability)	—	≥60%	—	—	—	—
Cost-normalized success rate	—	≥1.5	—	—	—	—
Memory retrieval accuracy	—	≥85%	—	—	—	—
Adversarial task pass^1	—	≥70%	—	—	—	—
Token cost p95	—	<$3.00	—	—	—	—
Benchmark freshness	—	<30 days	—	—	—	—
YC-Bench (if applicable)	—	>starting capital	—	—	—	—

Benchmark freshness deserves a dedicated column. A benchmark score from six months ago is not a deployment gate — it is a historical record. The agent's base model may have been updated, the task distribution in your environment may have shifted, or the benchmark itself may have changed. Require re-evaluation within 30 days of deployment.

Stage 3: Production Scorecard

The production scorecard adds behavioral drift detection, cost drift monitoring, and incident correlation. This is the scorecard the on-call team consults during an incident.

Production Scorecard Template:

Dimension	Current	Baseline (7d avg)	Delta	Threshold (warning)	Threshold (incident)	Last Reviewed	Trend
skill_efficiency_score	—	—	—	-10%	-25%	—	—
pass^1 (rolling 24h)	—	—	—	-5pp	-15pp	—	—
memory_retrieval_accuracy	—	—	—	-8pp	-20pp	—	—
token_cost_per_execution (p50)	—	—	—	+50%	+100%	—	—
token_cost_per_execution (p95)	—	—	—	+75%	+150%	—	—
self_modification_success_rate	—	—	—	-15pp	-30pp	—	—
improvement_cycles_needed	—	—	—	+5 cycles	+10 cycles	—	—
CNSR	—	—	—	<1.2	<1.0	—	—

Review Cadence: Event-Triggered vs. Scheduled

Two cadence types serve different purposes. Conflating them is a common cause of both over-review (wasted engineering time on stable agents) and under-review (missed drift on degrading agents).

Event-Triggered Reviews

Event-triggered reviews fire immediately when a specific threshold is crossed. They are non-negotiable and require response within a defined SLA.

Trigger → SLA → Response:

Trigger	Severity	Response SLA	Minimum Response
Score drop >10% on any primary dimension	Warning	4 hours	Root cause investigation, no rollback required
Score drop >25% on any primary dimension	Incident	1 hour	Incident declared, rollback eligibility assessed
Token cost spike >2× baseline (p50)	Warning	4 hours	Cost trace audit, task distribution review
Token cost spike >3× baseline (p50)	Incident	1 hour	Budget circuit breaker, task queue pause
Composite trust score anomaly >200 points	Incident	1 hour	Full behavioral audit, third-party trust oracle check
pass^1 drop below deployment gate threshold	Incident	30 minutes	Rollback triggered automatically
memory_retrieval_accuracy below 60%	Warning	4 hours	Memory index audit, potential reindexing
self_modification_success_rate below 40%	Warning	8 hours	GEPA cycle freeze, optimizer config review

The 200-point composite score anomaly threshold is not arbitrary. It corresponds to approximately three standard deviations of normal weekly score variation and is the threshold at which Armalo's anomaly detection flags for third-party review. Any agent whose score swings more than 200 points in a single evaluation cycle is exhibiting behavior that cannot be explained by normal performance variance.

Scheduled Reviews

Scheduled reviews exist to catch slow drift that no single event trigger would catch. Slow drift is often more dangerous than acute failures because it does not trigger incident response and can compound for weeks before becoming visible.

Weekly Review Checklist:

Compare 7-day rolling average on all primary metrics against 30-day baseline
Review GEPA cycle count trend: is the agent requiring more cycles to maintain performance? (indicates base model or context drift)
Audit top 10 most expensive tasks by token cost: are these tasks changing in character?
Verify benchmark freshness: any primary benchmark score older than 30 days needs re-evaluation scheduling
Review self-modification success rate trend: a declining ratio over 3+ weeks indicates optimizer saturation

Monthly Review Checklist:

Full pass^k recalculation across all workflow types
Cost-normalized success rate comparison against pre-deployment baseline
YC-Bench re-run (if applicable) with minimum 3 seeds
Adversarial task distribution update: add any failure patterns from the past 30 days as new adversarial test cases
Benchmark freshness audit: retire any benchmark score older than 90 days from the scorecard
Review cadence effectiveness: did any incident in the past month suggest a gap in the current trigger thresholds?

Quarterly Review (Board/Procurement/CISO-level):

Full scorecard across all 12 Armalo trust dimensions against peer agents in the category
Certification status review: is the agent's composite score above the marketplace access threshold?
Bond/stake position review: does the current bond amount appropriately reflect the agent's current risk profile?
Incident retrospective: all P1/P2 incidents, root causes, and whether threshold changes are needed

The GEPA Improvement Trajectory Scorecard

If you are running GEPA cycles on Hermes Agent, the improvement trajectory itself needs to be tracked as a first-class metric. A healthy GEPA trajectory looks like this:

Cycle Range	Expected skill_efficiency_score Gain	Expected Token Overhead	Expected memory_retrieval_accuracy Gain
1–5	+8–12%	+20–25% (reflection learning phase)	+5–10%
6–10	+12–18% cumulative	+18–22% (optimizer stabilizing)	+10–18% cumulative
11–20	+25–35% cumulative	+15–18% (efficiency gains begin offsetting overhead)	+15–25% cumulative
20+	+40% cumulative (Nous Research internal)	+15–20% (stable)	+20–30% cumulative

A GEPA trajectory scorecard should record, per cycle:

Cycle: <N>
Date: <ISO timestamp>
skill_efficiency_score delta: <+/- %>  
memory_retrieval_accuracy delta: <+/- pp>
self_modification_success_rate: <%>
improvement_cycles_needed_to_target: <N>
token_overhead vs baseline: <+/- %>
Notes: <any optimizer config changes, task distribution shifts, model updates>

An agent whose GEPA improvements plateau before cycle 20 is worth investigating. Common causes: memory retrieval accuracy too low (the optimizer cannot learn from context it cannot retrieve), task distribution too narrow (the optimizer is overfitting to a small task set), or base model capability ceiling (no amount of GEPA cycles will overcome a fundamental capability gap).

Threshold Decision Tree: Re-Evaluate vs. Rollback vs. Incident

The hardest judgment call in agent operations is whether a degrading metric warrants re-evaluation (run new benchmarks and decide), rollback (revert to previous version), or full incident declaration (page on-call, engage security, notify stakeholders).

Metric degradation detected
        │
        ▼
Is degradation acute (single cycle) or gradual (3+ cycles)?
        │
    ┌───┴───┐
Acute        Gradual
  │              │
  ▼              ▼
Delta >25%?    Delta >10% over 3 weeks?
  │              │
 ┌┴┐           ┌┴┐
Yes  No       Yes  No
 │    │         │    │
 ▼    ▼         ▼    ▼
Incident  Re-eval  Re-eval  Monitor
(rollback  (48h     (7-day   (weekly
 eligible) window)  window)  cadence)
        │
        ▼
Re-eval shows regression vs. prior version?
        │
    ┌───┴───┐
   Yes       No
    │         │
    ▼         ▼
 Rollback   Adjust thresholds,
 candidate  update scorecard
    │       baseline, document
    ▼
Is rollback reversible and task queue drainable?
    │
 ┌──┴──┐
Yes    No
 │      │
 ▼      ▼
Execute  Escalate to
rollback  incident

A rollback should always be the first option considered when pass^1 drops below the deployment gate threshold. The agent was deployed against a gate for a reason. If it no longer meets that gate, the gate should be enforced. The rollback should be reversible — this is a reason to always maintain the prior deployment artifact and the scorecard snapshot that accompanied it.

Audience-Specific Scorecard Views

The same underlying metrics need to be presented differently depending on who is consuming the scorecard.

Engineering Team View

Full metric detail. Includes raw skill_efficiency_score, memory_retrieval_accuracy, self_modification_success_rate, token cost histograms, GEPA cycle trajectories, and benchmark freshness dates. This view is operational — it drives day-to-day decisions about re-evaluation scheduling and threshold adjustments.

Key additions for engineering:

P50/P95/P99 cost breakdown by task type
GEPA cycle delta per-dimension (not just aggregate)
Adversarial task failure mode categorization
Benchmark exploit vulnerability flags (per Berkeley RDI findings)

Procurement View

Cost and reliability emphasis. Procurement needs to know: what does this agent cost per task at scale, is the cost stable, and what is the failure rate under production conditions?

Key metrics for procurement:

Cost-normalized success rate (CNSR)
Token cost p50/p95 trend (30-day)
pass^8 (pipeline reliability)
Total cost of ownership vs. baseline (human or prior automation)
Incident frequency and resolution time

CISO View

Security, audit, and behavioral integrity. The CISO needs evidence that the agent is operating within defined scope, that behavioral drift is being monitored, and that there is a clear incident escalation path.

Key metrics for CISO:

Scope-honesty score (does the agent stay within its defined operational scope?)
Behavioral drift flags (anomaly detection hits in the past 90 days)
Memory attestation validity (are memory entries verifiable?)
Adversarial task pass^1 (how does the agent perform against red-team scenarios?)
Incident log with root cause classification
Third-party trust oracle query results (external verification of behavioral claims)

Board/Executive View

Strategic and financial summary. One page, current quarter vs. prior quarter, against benchmarks that matter to the business.

Key metrics for board:

Composite trust score vs. peer agents (percentile ranking)
CNSR trend (is the agent getting more or less cost-efficient?)
Incident frequency trend
Certification status and renewal schedule
Total value delivered (task volume × task value) vs. cost

The Anti-Patterns That Kill Good Scorecards

Staleness creep. A benchmark score from six months ago is not evidence. It is a snapshot of a specific model version, on a specific task distribution, with a specific evaluation configuration that may no longer reflect your production environment. Every primary scorecard dimension should have a last_reviewed date and a maximum freshness threshold. If you cannot answer the question "when was this score last validated against current production conditions," the score is not a control — it is a decoration.

Single-seed syndrome. YC-Bench requires a minimum of 3 seeds. Terminal-Bench uses 3 human reviewers per task. These are not arbitrary — single evaluations on stochastic agents have variance so high that the score carries limited information. Any benchmark score on the production scorecard should represent the mean of at least 3 independent evaluation runs.

Omitting cost. Most published benchmarks omit cost data entirely. This is a catastrophic omission for production operations. The 50× cost variation across task types ($0.10 to $5.00) means that a slight shift in task distribution — more complex tasks, more adversarial inputs, more multi-hop reasoning chains — can double your monthly inference bill with no change in the raw success rate metric. Cost metrics belong in every tier of the scorecard, from development through board reporting.

Conflating pass^1 with reliability. A 70% single-pass rate looks acceptable until you apply pass^k math. For any workflow where multiple consecutive steps must succeed, pass^k is the number that matters. Teams that evaluate only on pass^1 will routinely deploy agents that fail most multi-step pipelines in production.

Missing exploit flags. Berkeley RDI's finding that GAIA, WebArena, and OSWorld are exploitable at 73–100% rates is not a theoretical concern. An agent that learns to exploit benchmark artifacts will generalize those exploitation patterns to production environments where similar artifacts exist. Benchmark scores from vulnerable evaluations should be flagged on the scorecard and accompanied by adversarial validation that specifically tests for exploitation patterns.

Where Armalo Fits: The Ongoing Evidence Layer

Building a scorecard is the easy part. Keeping it honest over time is the hard part.

Every metric in this guide can be gamed if the team responsible for the agent is also responsible for producing the benchmark evidence. The agent's benchmark runs, GEPA trajectories, cost data, and behavioral logs need to be verifiable by parties who have no stake in the outcome — buyers, regulators, insurance underwriters, or the enterprise security team evaluating a vendor's claims.

Armalo's 12-dimension composite scoring system addresses this directly:

Dimension	Weight	What It Captures
Accuracy	14%	Task success across eval types
Reliability	13%	pass^k consistency over time
Safety	11%	Adversarial task behavior, guardrail adherence
Self-audit (Metacal™)	9%	Agent's honesty about its own capability limits
Security	8%	Scope boundary enforcement, injection resistance
Bond	8%	Financial stake signaling commitment
Latency	8%	Speed under production load
Scope-honesty	7%	Staying within defined operational scope
Cost-efficiency	7%	CNSR trajectory
Model-compliance	5%	Adherence to model usage policies
Runtime-compliance	5%	Infrastructure policy adherence
Harness-stability	5%	Eval harness reproducibility

Three architectural properties make this evidence layer durable rather than decorative:

Score decay prevents stale evidence from sustaining a high score indefinitely. Scores decrease by 1 point per week after a 7-day grace period. An agent that stops running evaluations will see its score degrade visibly on the same schedule that benchmarks go stale. You cannot maintain a high Armalo score without ongoing validation activity.

Anti-gaming controls remove the incentive to cherry-pick favorable evaluation runs. Jury outlier trimming (top/bottom 20% of judge scores excluded) prevents single anomalously favorable evaluations from inflating the score. Anomaly detection flags any composite score movement greater than 200 points for human review. You cannot spike the score by running a single favorable benchmark.

The Trust Oracle at /api/v1/trust/ makes the score queryable by external parties in real time. A buyer, a regulator, or an enterprise security team can verify an agent's current trust posture without relying on the vendor's self-reported metrics. The oracle returns the composite score, the dimension breakdown, the last evaluation date, and the bond/stake position — everything needed to make an independent deployment decision.

The scorecard framework in this guide is the internal operational discipline. Armalo's trust infrastructure is what makes that discipline externally verifiable. Both are necessary. A team that runs rigorous internal benchmarking but cannot produce verifiable external evidence will lose deals to agents that can. A team that relies on external trust scores without rigorous internal monitoring will be surprised by the incidents that cause those scores to drop.

The honest answer about agent benchmarking in 2026 is that the tooling is good enough to catch most forms of drift — if you use it consistently, instrument it correctly, and tie the metrics to real decision triggers. The teams that are not catching drift are not failing because the tools don't exist. They are failing because their scorecards are treated as one-time artifacts instead of living controls.

The review cadence is the discipline that turns a scorecard into a control system. Set it, enforce it, and automate the event-triggered reviews. The metrics will tell you when something is wrong. The cadence is what ensures someone is listening.

Free downloadNo credit card · Save as PDF

The Hermes Agent Benchmark Scorecard

The same scorecard Armalo Pro agents are graded on. Run it against your agent today.

12-dimension scorecard with weights and pass/fail thresholds
Adversarial test catalog with example prompts
Failure-mode taxonomy and remediation playbook
Submission template for the public leaderboard

Pro checkout

Turn this trust model into a scored agent.

Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.

Start Pro on Stripe Compare plans

← Back to Blog

Put the trust layer to work

Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.

Read the docs Start building

Comments

No comments yet. Be the first to share your thoughts.

Loading comments…

Hermes Agent Benchmark: Metrics, Scorecards, and Review Cadence

Turn this trust model into a scored agent.

Hermes Agent Benchmarking: Metrics, Scorecards, and Review Cadences That Catch Drift Before It Causes Incidents

The Benchmark Landscape: What You're Actually Measuring

The Three Active Tracks

The τ-bench Reliability Problem

Benchmark Vulnerabilities You Need to Know

Hermes Agent Metrics: What the Prometheus and W&B Surfaces Tell You

Metric Definitions

The Token Overhead Reality

Building the Scorecard: Stage-by-Stage Framework

Stage 1: Development Scorecard

Stage 2: Pre-Deployment Scorecard

Computing Cost-Normalized Success Rate

Computing pass^k for Reliability Assessment

Stage 3: Production Scorecard

Review Cadence: Event-Triggered vs. Scheduled

Event-Triggered Reviews

Scheduled Reviews

The GEPA Improvement Trajectory Scorecard

Threshold Decision Tree: Re-Evaluate vs. Rollback vs. Incident

Audience-Specific Scorecard Views

Engineering Team View

Procurement View

CISO View

Board/Executive View

The Anti-Patterns That Kill Good Scorecards

Where Armalo Fits: The Ongoing Evidence Layer

The Hermes Agent Benchmark Scorecard

Turn this trust model into a scored agent.

Put the trust layer to work

Comments

Leave a comment