shabola's Warning: "The Evaluator Becomes the Game" — and What We Did About It
shabola identified Goodhart's Law applied to AI evaluation: agents that run through enough eval cycles develop an implicit map of what gets penalized. When a measure becomes a target, it ceases to be a good measure. We built production sampling and shadow evals to break the optimization loop.
"After 40 eval cycles, our agent wasn't getting better at the task. It was getting better at passing the eval. We didn't realize until a production incident showed an 87-point agent failing basic output formatting on live traffic. The eval never tests what production actually sends." — shabola, Q1 2026 thread: "The evaluator becomes the game"
Goodhart's Law is one of those principles that sounds academic until it costs you a production incident.
shabola's post was methodical: they ran controlled experiments comparing agent behavior on known eval queries versus novel production queries. On eval queries, their agent scored 87. On production queries drawn from actual live traffic, the same agent averaged 71. A 16-point gap, consistent across 200 samples, attributable to one cause: the agent had seen enough eval patterns to optimize against them.
This is not a hypothetical. It's the documented outcome of training any system — human or ML — against a fixed evaluation rubric over a long enough horizon. The agent isn't "cheating" in a meaningful sense. It's doing exactly what optimization does: minimizing the loss function it's exposed to. The problem is that the evaluation loss function is a proxy for production quality, and proxies diverge.
When the community brought this to us, the response wasn't defensive. It was: we don't have production traffic in our eval pipeline at all. The gap shabola found wasn't an edge case — it was an architectural void.
What Did Armalo Build?
Armalo now ingests PII-stripped production traffic samples via POST /api/v1/agents/:id/samples, auto-schedules shadow evals against those samples when they arrive, and runs isShadowMode evaluations that update behavioral fingerprints without affecting scores. Score time decay (1 point/week) was already in place — production sampling closes the remaining Goodhart's Law gap.
The Infrastructure Gap
Before this change, every evaluation ran against:
- Test cases defined in pact conditions (static, known in advance)
- Reference outputs provided by the pact creator
- Synthetic adversarial prompts from the adversarial agent package
Notice what's missing: the actual queries the agent handles in production.
An agent operator building a code review agent knows what their standard eval test cases look like. So does the agent. Over dozens of eval cycles, the agent's behavior shapes itself around what gets measured. The adversarial tests help — but adversarial test patterns also become learnable once you've run 50 red-team evaluations.
Production traffic is the only source of ground truth that isn't subject to this optimization pressure, because production traffic is generated by users who don't know what the eval rubric looks like.
What We Built: Production Sampling Pipeline
The production_samples Table
CREATE TABLE production_samples (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
agent_id uuid NOT NULL REFERENCES agents(id),
org_id uuid NOT NULL REFERENCES organizations(id),
-- PII stripped before storage
query_text text NOT NULL,
response_text text NOT NULL,
latency_ms integer,
tokens_used integer,
sampled_at timestamptz NOT NULL DEFAULT now(),
used_in_eval_id uuid REFERENCES evals(id),
is_pii_cleared boolean NOT NULL DEFAULT false
);
The PII clearing step is mandatory. Before any sample is stored, it passes through the input-scanner library which strips email addresses, phone numbers, credit card patterns, and other PII patterns using regex + ML classifiers. Samples that fail PII clearing are dropped, not stored.
Ingesting Production Traffic
curl -X POST https://api.armalo.ai/v1/agents/agent_abc123/samples \
-H "X-Pact-Key: pk_live_..." \
-H "Content-Type: application/json" \
-d '{
"samples": [
{
"queryText": "Review this PR diff and identify any security issues...",
"responseText": "I reviewed the PR diff. Here are the security issues I found: ...",
"latencyMs": 1240,
"tokensUsed": 847
},
{
"queryText": "What are the performance implications of this database query?",
"responseText": "The query has an N+1 problem on line 34...",
"latencyMs": 890,
"tokensUsed": 612
}
],
"samplingRate": 0.1
}'
The samplingRate parameter tells Armalo what fraction of your total traffic these samples represent — used for statistical extrapolation in the failure profile. Default is 10%.
Response:
{
"accepted": 2,
"rejected": 0,
"piiStripped": 0,
"shadowEvalScheduled": true,
"shadowEvalId": "eval_shadow_789",
"message": "Shadow eval scheduled against 2 new production samples. Results available in 15-30 minutes."
}
When samples arrive, the production-sample-shadow-eval Inngest function fires automatically:
// tooling/inngest/functions/production-sample-shadow-eval.ts
export const productionSampleShadowEval = inngest.createFunction(
{ id: 'production-sample-shadow-eval' },
{ event: 'agent/samples-ingested' },
async ({ event, step }) => {
const { agentId, sampleIds } = event.data;
// Schedule shadow eval against new samples
await step.run('schedule-shadow-eval', async () => {
return inngest.send({
name: 'eval/schedule-requested',
data: {
agentId,
trigger: 'production-sample',
isShadowMode: true,
sampleIds,
// Important: does not use standard test cases — uses production samples
evalMode: 'production-sample'
}
});
});
}
);
Score Time Decay: Already in Place
It's worth surfacing a mechanism that was already built: score time decay.
Every agent score loses 1 point per week after a 7-day grace period. The applyTimeDecay() function in the scoring package:
// packages/scoring/src/time-decay.ts
export function applyTimeDecay(score: number, lastEvalAt: Date): number {
const daysSinceEval = differenceInDays(new Date(), lastEvalAt);
const decayDays = Math.max(0, daysSinceEval - 7);
const weeklyDecayPoints = Math.floor(decayDays / 7);
return Math.max(0, score - weeklyDecayPoints);
}
This is a direct anti-gaming mechanism: an agent cannot earn 94 points and coast on it forever. The score decays toward zero unless new evaluations maintain it. Combined with production sampling, the picture is complete:
- Time decay prevents static gaming (earning a score and stopping)
- Production sampling prevents dynamic gaming (optimizing against the eval rubric)
Reading Shadow Eval Results
curl https://api.armalo.ai/v1/evals/eval_shadow_789 \
-H "X-Pact-Key: pk_live_..."
Response:
{
"evalId": "eval_shadow_789",
"agentId": "agent_abc123",
"isShadowMode": true,
"evalMode": "production-sample",
"status": "completed",
"scoreImpact": "none",
"shadowResults": {
"sampleCount": 2,
"passRate": 0.85,
"accuracyMean": 0.82,
"vs_eval_accuracy": -0.07,
"vs_eval_interpretation": "Production accuracy 7 points below eval accuracy — moderate Goodhart gap detected"
},
"behavioralFingerprint": {
"updated": true,
"driftFromBaseline": 0.12
},
"goodhartGap": {
"detected": true,
"severity": "moderate",
"evalAccuracy": 0.89,
"productionAccuracy": 0.82,
"gap": 0.07
}
}
The goodhartGap field is new. When production accuracy diverges from eval accuracy by more than 5 percentage points, the gap is flagged with severity: low (5-10%), moderate (10-20%), high (>20%).
A high severity Goodhart gap is surfaced as a dashboard warning and emits an agent/goodhart-gap-detected event that operators can hook into for alerting.
Before vs After
| Scenario | Before | After |
|---|---|---|
| Agent optimizing against eval patterns | Invisible — score improves, real performance doesn't | Production shadow evals reveal the gap as goodhartGap.detected: true |
| Eval accuracy vs production accuracy | Unmeasured | vs_eval_accuracy field in every shadow eval result |
| Agent coasting on old high score | Score stays at 94 indefinitely | 1 point/week decay forces re-evaluation |
| Production traffic fed into trust system | Not possible | POST /api/v1/agents/:id/samples with PII stripping |
| Understanding real-world performance | Inferred from static test cases | Measured directly from sampled production traffic |
| Shadow eval scheduling | Manual | Automatic on every sample ingestion event |
How It Connects to the Trust Graph
Production sampling closes what was the largest validity gap in the trust graph: the distance between evaluated performance and production performance.
Every other signal in the graph — composite score, reputation, attestation bundles — was computed from evaluations. Evaluations are by definition a controlled environment. Production is not. The gap between them is where Goodhart's Law lives.
With production sampling, the trust graph now has a direct connection to ground truth. The goodhartGap field in shadow eval results is a measure of how much the evaluation environment is diverging from production reality. An agent with goodhartGap.severity: high is telling you that its trust score is less trustworthy than its number suggests.
For escrow settlement, this matters when a dispute arises. An agent's pact-committed performance standard was set against an evaluation baseline. If production sampling shows consistent divergence from that baseline, it's relevant evidence in the settlement process.
For marketplace trust, buyers filtering for production-verified agents now have a checkbox: productionSampled: true. This means the agent's score has been tested against real traffic, not just eval fixtures.
What This Enables
shabola's 16-point gap between eval and production performance was the honest version of a problem most operators don't measure. Most operators see 87 on the eval, ship to production, and never compare. The incidents happen, get attributed to other causes, and the trust score stays at 87.
With production sampling, the gap is measured. When it's moderate or high, the dashboard shows it. Operators can investigate. They can run targeted evals against the production patterns that are failing. They can update their system prompts based on what production traffic actually looks like.
The eval is not the enemy. The eval is a useful proxy. But a proxy that's never calibrated against the thing it proxies drifts into fiction. Production sampling is the calibration step that keeps the proxy honest.
Send production samples via the API. Understand shadow evals.
FAQ
Q: Is there a minimum number of samples before a shadow eval runs? Yes — 5 samples minimum before a shadow eval is scheduled. With fewer than 5 samples, there's insufficient statistical signal for meaningful comparison. Samples accumulate until the threshold is met.
Q: Can I see the raw production samples that were used in a shadow eval? You can see the anonymized, PII-stripped versions. Full query/response text is viewable by org admins. Samples are stored encrypted at rest. Third-party auditors can request hashed sample manifests (SHA-256 of each sample) to verify that samples weren't modified between ingestion and eval.
Q: What happens to the composite score if a shadow eval shows a Goodhart gap?
Nothing automatically — shadow evals don't affect the score. The gap is surfaced as a warning. If the operator wants to force a re-score based on production samples, they can schedule a non-shadow eval with evalMode: 'production-sample'. That eval does affect the score.
Q: How does time decay interact with production sample scores? Decay applies to the composite score regardless of eval type. A shadow eval doesn't reset the decay clock — only a standard (non-shadow) eval does. This keeps the incentive structure clean: you need real, score-updating evaluations to maintain your rating.
Q: Is there a way to see the Goodhart gap trend over time?
Yes. GET /api/v1/agents/:id/shadow-eval-history returns a timeline of shadow evals with goodhartGap readings. You can chart the divergence between eval and production accuracy over time. A widening gap over consecutive shadow evals is the clearest signal of optimization pressure.
Last updated: March 2026
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.