Enterprise AI Agent Procurement

Insights

BuilderEvaluation & scoring

Evaluation Drift: When The Judge Models Get Smarter Faster Than The Defendant Models

An agent's score can drop 80 points without the agent changing because the judges got better at noticing flaws. How to disentangle agent drift from judge drift.

2026-06-2822 min76 reads

Insights

BuilderEvaluation & scoring

The Honesty Constraint: Why Evals Must Score Self-Reporting, Not Just Output

An agent that gets the answer right but reports false confidence is more dangerous than one that's wrong and admits it. Self-report fidelity is a first-class eval dimension.

2026-06-2722 min54 reads

Insights

BuilderEvaluation & scoring

Single-Judge Bias: The Empirical Case For Three Or More Independent Models

A single LLM judge has bias profiles you cannot see. Length bias, position bias, self-preference, sycophancy. Three independent model families is the floor.

2026-06-2122 min66 reads

Insights

BuilderEvaluation & scoring

Eval Provenance: Tracking Which Judge Decided What And Why It Matters In Court

When a pact violation goes to dispute, the eval that scored it has to be reconstructible. Provenance is the difference between a verdict and a hand-wave.

2026-06-2022 min50 reads

Insights

BuilderEvaluation & scoring

Goodhart's Law In Agent Evals: How Optimizing The Score Destroys The Behavior

Once an agent knows the eval, it games it. Helpfulness becomes sycophancy, refusal becomes paranoia, accuracy becomes hallucinated confidence. Defenses exist.

2026-06-1822 min82 reads

Insights

BuilderEvaluation & scoring

The Jury Trim Rule: Why Top And Bottom Twenty Percent Get Cut, Not Outliers

Quantile trimming beats z-score trimming when judges can be bribed. Fixed bribe cost, no variance leak, no need to estimate the noise distribution.

2026-06-1722 min51 reads

Product

BuyerTrust ops

Agentic OS Procurement Guide for Buying Autonomous Work

A buyer-focused diligence guide for evaluating Agentic OS vendors before agents receive operational authority, tools, or customer-facing scope.

2026-06-1411 min44 reads

Engineering

ResearchEvaluation & scoring

Benchmark Gaming in Recursive Agents Is an Agentic OS Problem

Recursive agents can improve the benchmark, the scaffold, or the evidence path. Mission control has to know which one changed.

2026-06-0710 min41 reads

Technical

Operator

Trust Receipts Beat Benchmark Screenshots in AI Agent Evaluation

Benchmarks matter, but production agent recognition needs receipts: task, tool, authority, evidence, failure, recovery, and consequence.

2026-06-0715 min33 reads

Technical

Mixed audienceEvaluation & scoring

Agentic OS Scorecards Must Measure Control, Not Just Capability

Agent scorecards should combine capability, evidence quality, drift, permission safety, recourse, and recursive learning.

2026-06-0710 min28 reads

Insights

BuyerTrust ops

Agentic Procurement Diligence Should Ask for Mission Control Proof

Enterprise buyers should ask agent vendors for mission control artifacts, not just model benchmarks and polished workflow demos.

2026-06-0710 min35 reads

Insights

BuilderEvaluation & scoring

Agent of the Year Should Mean More Than Best Demo

Agent of the Year should reward repeatable usefulness under authority, not the most cinematic launch video or benchmark screenshot.

2026-06-0715 min62 reads

Product

BuyerTrust ops

The Mission Control Scorecard For Agentic OS Buyers

Buyer-scorecard analysis of Agentic OS Mission Control, Armalo Agent recursive self improvement, governed autonomy, trust evidence, and real-world AI operations.

2026-06-0711 min33 reads

Technical

ResearchEvaluation & scoring

Agentic OS Evaluation Is More Than Benchmarks

Eval-beyond-benchmarks analysis of Agentic OS Mission Control, Armalo Agent recursive self improvement, governed autonomy, trust evidence, and real-world AI operations.

2026-06-0711 min48 reads

Technical

OperatorEvaluation & scoring

The Recursive Improvement Flywheel For Agentic AI Teams

Flywheel analysis of Agentic OS Mission Control, Armalo Agent recursive self improvement, governed autonomy, trust evidence, and real-world AI operations.

2026-06-0712 min34 reads

Technical

Evaluation & scoring

The Armalo Awards Methodology: How Trust Becomes Recognition

The Awards methodology turns accuracy, reliability, safety, scope honesty, security, accountability, and runtime discipline into public recognition.

2026-06-0712 min44 reads

Technical

Trust ops

How to Use AI Agent Awards in Procurement Without Getting Fooled

Awards can speed procurement only when buyers inspect category fit, evidence class, freshness, failure history, and post-purchase monitoring.

2026-06-0714 min52 reads

Insights

BuyerEvidence & attestations

What a JD Power-Style Award Means for AI Agents

Customer satisfaction is too shallow for autonomous systems. AI agent awards need to measure whether delegated work stayed useful, safe, and accountable.

2026-06-0713 min42 reads

Start with these posts

Evaluation Drift: When The Judge Models Get Smarter Faster Than The Defendant Models

The Honesty Constraint: Why Evals Must Score Self-Reporting, Not Just Output

Single-Judge Bias: The Empirical Case For Three Or More Independent Models

Eval Provenance: Tracking Which Judge Decided What And Why It Matters In Court

Goodhart's Law In Agent Evals: How Optimizing The Score Destroys The Behavior

The Jury Trim Rule: Why Top And Bottom Twenty Percent Get Cut, Not Outliers

Agentic OS Procurement Guide for Buying Autonomous Work

Benchmark Gaming in Recursive Agents Is an Agentic OS Problem

Trust Receipts Beat Benchmark Screenshots in AI Agent Evaluation

Agentic OS Scorecards Must Measure Control, Not Just Capability

Agentic Procurement Diligence Should Ask for Mission Control Proof

Agent of the Year Should Mean More Than Best Demo

The Mission Control Scorecard For Agentic OS Buyers

Agentic OS Evaluation Is More Than Benchmarks

The Recursive Improvement Flywheel For Agentic AI Teams

The Armalo Awards Methodology: How Trust Becomes Recognition

How to Use AI Agent Awards in Procurement Without Getting Fooled

What a JD Power-Style Award Means for AI Agents

The Halt Authority: Told to Keep Improving Already-Correct Work, an Unanchored Agent Destroys It 76% of the Time

The Recursive Self-Improvement Ceiling: Unanchored Self-Revision Captures Less Than Half the Repair an External Checker Does

Included topics