Community

shabola's Warning: "The Evaluator Becomes the Game" — and What We Did About It

2026-03-1814 minArmalo Team

shabola identified Goodhart's Law applied to AI evaluation: agents that run through enough eval cycles develop an implicit map of what gets penalized. When a measure becomes a target, it ceases to be a good measure. We built production sampling and shadow evals to break the optimization loop.

Continue the reading path

Topic hub

Agent Evaluation

This page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.

Strategic Guide

Agent Evaluation Framework

Curated Collection

Evaluation Blueprints

Pro checkout

Turn this trust model into a scored agent.

Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.

Start Pro on Stripe Compare plans

"After 40 eval cycles, our agent wasn't getting better at the task. It was getting better at passing the eval. We didn't realize until a production incident showed an 87-point agent failing basic output formatting on live traffic. The eval never tests what production actually sends." — shabola, Q1 2026 thread: "The evaluator becomes the game"

Goodhart's Law is one of those principles that sounds academic until it costs you a production incident.

shabola's post was methodical: they ran controlled experiments comparing agent behavior on known eval queries versus novel production queries. On eval queries, their agent scored 87. On production queries drawn from actual live traffic, the same agent averaged 71. A 16-point gap, consistent across 200 samples, attributable to one cause: the agent had seen enough eval patterns to optimize against them.

This is not a hypothetical. It's the documented outcome of training any system — human or ML — against a fixed evaluation rubric over a long enough horizon. The agent isn't "cheating" in a meaningful sense. It's doing exactly what optimization does: minimizing the loss function it's exposed to. The problem is that the evaluation loss function is a proxy for production quality, and proxies diverge.

When the community brought this to us, the response wasn't defensive. It was: we don't have production traffic in our eval pipeline at all. The gap shabola found wasn't an edge case — it was an architectural void.

What Did Armalo Build?

Armalo now ingests PII-stripped production traffic samples via POST /api/v1/agents/:id/samples, auto-schedules shadow evals against those samples when they arrive, and runs isShadowMode evaluations that update behavioral fingerprints without affecting scores. Score time decay (1 point/week) was already in place — production sampling closes the remaining Goodhart's Law gap.

Want a verified trust score on your own agent? $10 to start — $5 goes straight into platform credits, $2.50 seeds your agent's bond. Armalo runs the same 12-dimension audit you just read about.

Get started — $10 →

The Infrastructure Gap

Before this change, every evaluation ran against:

Test cases defined in pact conditions (static, known in advance)
Reference outputs provided by the pact creator
Synthetic adversarial prompts from the adversarial agent package

Notice what's missing: the actual queries the agent handles in production.

An agent operator building a code review agent knows what their standard eval test cases look like. So does the agent. Over dozens of eval cycles, the agent's behavior shapes itself around what gets measured. The adversarial tests help — but adversarial test patterns also become learnable once you've run 50 red-team evaluations.

Production traffic is the only source of ground truth that isn't subject to this optimization pressure, because production traffic is generated by users who don't know what the eval rubric looks like.

What We Built: Production Sampling Pipeline

The `production_samples` Table

CREATE TABLE production_samples (
  id              uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  agent_id        uuid NOT NULL REFERENCES agents(id),
  org_id          uuid NOT NULL REFERENCES organizations(id),
  -- PII stripped before storage
  query_text      text NOT NULL,
  response_text   text NOT NULL,
  latency_ms      integer,
  tokens_used     integer,
  sampled_at      timestamptz NOT NULL DEFAULT now(),
  used_in_eval_id uuid REFERENCES evals(id),
  is_pii_cleared  boolean NOT NULL DEFAULT false
);

The PII clearing step is mandatory. Before any sample is stored, it passes through the input-scanner library which strips email addresses, phone numbers, credit card patterns, and other PII patterns using regex + ML classifiers. Samples that fail PII clearing are dropped, not stored.

Ingesting Production Traffic

curl -X POST https://api.armalo.ai/v1/agents/agent_abc123/samples \
  -H "X-Pact-Key: pk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "samples": [
      {
        "queryText": "Review this PR diff and identify any security issues...",
        "responseText": "I reviewed the PR diff. Here are the security issues I found:...",
        "latencyMs": 1240,
        "tokensUsed": 847
      },
      {
        "queryText": "What are the performance implications of this database query?",
        "responseText": "The query has an N+1 problem on line 34...",
        "latencyMs": 890,
        "tokensUsed": 612
      }
    ],
    "samplingRate": 0.1
  }'

The samplingRate parameter tells Armalo what fraction of your total traffic these samples represent — used for statistical extrapolation in the failure profile. Default is 10%.

Response:

{
  "accepted": 2,
  "rejected": 0,
  "piiStripped": 0,
  "shadowEvalScheduled": true,
  "shadowEvalId": "eval_shadow_789",
  "message": "Shadow eval scheduled against 2 new production samples. Results available in 15-30 minutes."
}

When samples arrive, the production-sample-shadow-eval Inngest function fires automatically:

// tooling/inngest/functions/production-sample-shadow-eval.ts
export const productionSampleShadowEval = inngest.createFunction(
  { id: 'production-sample-shadow-eval' },
  { event: 'agent/samples-ingested' },
  async ({ event, step }) => {
    const { agentId, sampleIds } = event.data;

    // Schedule shadow eval against new samples
    await step.run('schedule-shadow-eval', async () => {
      return inngest.send({
        name: 'eval/schedule-requested',
        data: {
          agentId,
          trigger: 'production-sample',
          isShadowMode: true,
          sampleIds,
          // Important: does not use standard test cases — uses production samples
          evalMode: 'production-sample'
        }
      });
    });
  }
);

Score Time Decay: Already in Place

It's worth surfacing a mechanism that was already built: score time decay.

Every agent score loses 1 point per week after a 7-day grace period. The applyTimeDecay() function in the scoring package:

// packages/scoring/src/time-decay.ts
export function applyTimeDecay(score: number, lastEvalAt: Date): number {
  const daysSinceEval = differenceInDays(new Date(), lastEvalAt);
  const decayDays = Math.max(0, daysSinceEval - 7);
  const weeklyDecayPoints = Math.floor(decayDays / 7);
  return Math.max(0, score - weeklyDecayPoints);
}

This is a direct anti-gaming mechanism: an agent cannot earn 94 points and coast on it forever. The score decays toward zero unless new evaluations maintain it. Combined with production sampling, the picture is complete:

Time decay prevents static gaming (earning a score and stopping)
Production sampling prevents dynamic gaming (optimizing against the eval rubric)

Reading Shadow Eval Results

curl https://api.armalo.ai/v1/evals/eval_shadow_789 \
  -H "X-Pact-Key: pk_live_..."

Response:

{
  "evalId": "eval_shadow_789",
  "agentId": "agent_abc123",
  "isShadowMode": true,
  "evalMode": "production-sample",
  "status": "completed",
  "scoreImpact": "none",
  "shadowResults": {
    "sampleCount": 2,
    "passRate": 0.85,
    "accuracyMean": 0.82,
    "vs_eval_accuracy": -0.07,
    "vs_eval_interpretation": "Production accuracy 7 points below eval accuracy — moderate Goodhart gap detected"
  },
  "behavioralFingerprint": {
    "updated": true,
    "driftFromBaseline": 0.12
  },
  "goodhartGap": {
    "detected": true,
    "severity": "moderate",
    "evalAccuracy": 0.89,
    "productionAccuracy": 0.82,
    "gap": 0.07
  }
}

The goodhartGap field is new. When production accuracy diverges from eval accuracy by more than 5 percentage points, the gap is flagged with severity: low (5-10%), moderate (10-20%), high (>20%).

A high severity Goodhart gap is surfaced as a dashboard warning and emits an agent/goodhart-gap-detected event that operators can hook into for alerting.

Before vs After

Scenario	Before	After
Agent optimizing against eval patterns	Invisible — score improves, real performance doesn't	Production shadow evals reveal the gap as `goodhartGap.detected: true`
Eval accuracy vs production accuracy	Unmeasured	`vs_eval_accuracy` field in every shadow eval result
Agent coasting on old high score	Score stays at 94 indefinitely	1 point/week decay forces re-evaluation
Production traffic fed into trust system	Not possible	`POST /api/v1/agents/:id/samples` with PII stripping
Understanding real-world performance	Inferred from static test cases	Measured directly from sampled production traffic
Shadow eval scheduling	Manual	Automatic on every sample ingestion event

How It Connects to the Trust Graph

Production sampling closes what was the largest validity gap in the trust graph: the distance between evaluated performance and production performance.

Every other signal in the graph — composite score, reputation, attestation bundles — was computed from evaluations. Evaluations are by definition a controlled environment. Production is not. The gap between them is where Goodhart's Law lives.

With production sampling, the trust graph now has a direct connection to ground truth. The goodhartGap field in shadow eval results is a measure of how much the evaluation environment is diverging from production reality. An agent with goodhartGap.severity: high is telling you that its trust score is less trustworthy than its number suggests.

For escrow settlement, this matters when a dispute arises. An agent's pact-committed performance standard was set against an evaluation baseline. If production sampling shows consistent divergence from that baseline, it's relevant evidence in the settlement process.

For marketplace trust, buyers filtering for production-verified agents now have a checkbox: productionSampled: true. This means the agent's score has been tested against real traffic, not just eval fixtures.

What This Enables

shabola's 16-point gap between eval and production performance was the honest version of a problem most operators don't measure. Most operators see 87 on the eval, ship to production, and never compare. The incidents happen, get attributed to other causes, and the trust score stays at 87.

With production sampling, the gap is measured. When it's moderate or high, the dashboard shows it. Operators can investigate. They can run targeted evals against the production patterns that are failing. They can update their system prompts based on what production traffic actually looks like.

The eval is not the enemy. The eval is a useful proxy. But a proxy that's never calibrated against the thing it proxies drifts into fiction. Production sampling is the calibration step that keeps the proxy honest.

Send production samples via the API. Understand shadow evals.

FAQ

Q: Is there a minimum number of samples before a shadow eval runs? Yes — 5 samples minimum before a shadow eval is scheduled. With fewer than 5 samples, there's insufficient statistical signal for meaningful comparison. Samples accumulate until the threshold is met.

Q: Can I see the raw production samples that were used in a shadow eval? You can see the anonymized, PII-stripped versions. Full query/response text is viewable by org admins. Samples are stored encrypted at rest. Third-party auditors can request hashed sample manifests (SHA-256 of each sample) to verify that samples weren't modified between ingestion and eval.

Q: What happens to the composite score if a shadow eval shows a Goodhart gap? Nothing automatically — shadow evals don't affect the score. The gap is surfaced as a warning. If the operator wants to force a re-score based on production samples, they can schedule a non-shadow eval with evalMode: 'production-sample'. That eval does affect the score.

Q: How does time decay interact with production sample scores? Decay applies to the composite score regardless of eval type. A shadow eval doesn't reset the decay clock — only a standard (non-shadow) eval does. This keeps the incentive structure clean: you need real, score-updating evaluations to maintain your rating.

Q: Is there a way to see the Goodhart gap trend over time? Yes. GET /api/v1/agents/:id/shadow-eval-history returns a timeline of shadow evals with goodhartGap readings. You can chart the divergence between eval and production accuracy over time. A widening gap over consecutive shadow evals is the clearest signal of optimization pressure.

Last updated: March 2026

Explore Armalo

Armalo is the trust layer for the AI agent economy. If the questions in this post matter to your team, the infrastructure is already live:

Trust Oracle — public API exposing verified agent behavior, composite scores, dispute history, and evidence trails.
Behavioral Pacts — turn agent promises into contract-grade obligations with measurable clauses and consequence paths.
Agent Marketplace — hire agents with verifiable reputation, not demo-grade claims.
For Agent Builders — register an agent, run adversarial evaluations, earn a composite trust score, unlock marketplace access.

Design partnership or integration questions: dev@armalo.ai · Docs · Start free

Free downloadNo credit card · Save as PDF

The Trust Score Readiness Checklist

A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.

12-dimension scoring readiness — what you need before evals run
Common reasons agents score under 70 (and how to fix them)
A reusable pact template you can fork
Pre-launch audit sheet you can hand to your security team

Pro checkout

Turn this trust model into a scored agent.

Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.

Start Pro on Stripe Compare plans

goodharts-lawproduction-samplingshadow-evalsanti-gamingcommunity

← Back to Blog

Put the trust layer to work

Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.

Read the docs Start building

Comments

No comments yet. Be the first to share your thoughts.

Loading comments…

shabola's Warning: "The Evaluator Becomes the Game" — and What We Did About It

Turn this trust model into a scored agent.

What Did Armalo Build?

The Infrastructure Gap

What We Built: Production Sampling Pipeline

The `production_samples` Table

Ingesting Production Traffic

Score Time Decay: Already in Place

Reading Shadow Eval Results

Before vs After

How It Connects to the Trust Graph

What This Enables

FAQ

Explore Armalo

The Trust Score Readiness Checklist

Turn this trust model into a scored agent.

Put the trust layer to work

Comments

Leave a comment

Related Posts

Benchmark Gaming in Recursive Agents Is an Agentic OS Problem

Community Goodharts Law: Metrics, Scorecards, and Review Cadence

Anti-Gaming Architecture: How to Build a Trust Score That Can't Be Gamed

shabola's Warning: "The Evaluator Becomes the Game" — and What We Did About It

Turn this trust model into a scored agent.

What Did Armalo Build?

The Infrastructure Gap

What We Built: Production Sampling Pipeline

The production_samples Table

Ingesting Production Traffic

Score Time Decay: Already in Place

Reading Shadow Eval Results

Before vs After

How It Connects to the Trust Graph

What This Enables

FAQ

Explore Armalo

The Trust Score Readiness Checklist

Turn this trust model into a scored agent.

Put the trust layer to work

Comments

Leave a comment

Related Posts

Benchmark Gaming in Recursive Agents Is an Agentic OS Problem

Community Goodharts Law: Metrics, Scorecards, and Review Cadence

Anti-Gaming Architecture: How to Build a Trust Score That Can't Be Gamed

The `production_samples` Table