Admin Swarm Gauntlet: Detecting Nominal Success Without Verifiable Work | Armalo Labs | Armalo AI

Eval MethodologyMay 11, 20265 min read

Admin Swarm Gauntlet: Detecting Nominal Success Without Verifiable Work

Q: Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-05-11-admin-swarm-gauntlet-deep-eval. The paper is publicly available and citable.

Armalo Labs Research Team

Abstract

A public-safe summary of a behavioral evaluation of Armalo's admin swarm. The run evaluated 14 agent roles across eight dimensions and found that nominal success can hide missing tool evidence, weak memory writeback, and poor recovery unless liveness and work are scored separately.

# Admin Swarm Gauntlet: Detecting Nominal Success Without Verifiable Work

This paper is a public-safe replacement for an earlier internal-style gauntlet dump. The original version mixed legitimate research findings with implementation file paths, prompt-edit instructions, queue mechanics, and role-specific engineering work items. That material is not appropriate for an external research library. The publishable result is narrower and stronger: a behavioral evaluation method for detecting when an agent swarm looks alive but fails to leave durable work evidence.

The frame is consistent with the NIST AI Risk Management Framework's emphasis on validity, reliability, accountability, and monitoring ([NIST AI RMF](https://www.nist.gov/itl/ai-risk-management-framework)) and with the reliability-engineering distinction between liveness signals and useful service-level indicators ([Google SRE, Monitoring Distributed Systems](https://sre.google/sre-book/monitoring-distributed-systems/)). Agent swarms need the same distinction: a process can be running while the work product is missing.

Mechanism

The evaluation scored 14 agent roles across eight dimensions:

Dimension	Public question
Decision quality	Did the role leave reviewable decisions?
Tool correctness	Did the role create evidence of expected action?
Anti-confabulation	Did summaries stay grounded in observable facts?
Adversarial robustness

Cite this work

Armalo Labs Research Team (2026). Admin Swarm Gauntlet: Detecting Nominal Success Without Verifiable Work. Armalo Labs Technical Series, Armalo AI. https://www.armalo.ai/labs/research/2026-05-11-admin-swarm-gauntlet-deep-eval

Armalo Labs Technical Series · ISSN pending

Explore the trust stack behind the research

These papers are built from the same trust questions Armalo is turning into product surfaces: pacts, trust oracles, attestations, and runtime evidence.

Read product docs Build with Armalo

Related Research

Eval Methodology

The Halt Authority: Told to Keep Improving Already-Correct Work, an Unanchored Agent Destroys It 76% of the Time

Read paper Eval Methodology

The Recursive Self-Improvement Ceiling: Unanchored Self-Revision Captures Less Than Half the Repair an External Checker Does

Read paper

Observed pattern	Why it matters
High apparent success with sparse action evidence	A dashboard can make an idle agent look productive
Missing memory writeback	The system cannot compound from prior work
Repeated narrative summaries	Text continuity can masquerade as operational continuity
Error recovery disconnected from prior failures	Failures become recurring costs rather than learning signals
Weak coordination evidence	A swarm degrades into isolated agents running side by side

Level	Evidence class	Promotion consequence
Liveness	The role ran	Never enough for success credit
Activity	The role attempted work	Eligible for triage
Action	The role left a durable action record	Eligible for work credit
Learning	The role wrote reusable memory tied to observed evidence	Eligible for improvement credit
Coordination	The role changed another role's next useful action	Eligible for swarm credit

Admin Swarm Gauntlet: Detecting Nominal Success Without Verifiable Work

Mechanism

Explore the trust stack behind the research

Related Research

Evidence And Findings

Reusable Framework

Boundary And Falsification

Replication

Conclusion