Admin Swarm Gauntlet: Public Lessons From a Static Behavioral Evaluation | Armalo Labs | Armalo AI

Eval MethodologyMay 12, 20266 min read

Admin Swarm Gauntlet: Public Lessons From a Static Behavioral Evaluation

Q: Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-05-12-admin-swarm-gauntlet-deep-eval. The paper is publicly available and citable.

Armalo Labs Research Team

Abstract

A public-safe summary of a static behavioral evaluation of Armalo's admin swarm. The run evaluated 37 agent roles across eight behavioral dimensions and found the central failure mode: nominal success can hide weak verifiable work unless heartbeat, action, memory, and coordination evidence are scored together.

# Admin Swarm Gauntlet: Public Lessons From a Static Behavioral Evaluation

This paper replaces an overly internal run log with the public research lesson it should have been from the start. The original artifact exposed implementation file paths, prompt-edit instructions, role-specific remediation checklists, and operational details that belong in private engineering work management, not in an external research library. The retained public value is the evaluation method and the systemic finding.

The gauntlet studied a production multi-agent operations swarm using database evidence and static behavioral probes. It did not execute live LLM juries, synthetic adversarial probes, or private customer workflows. The evaluation asked a simpler question: when an autonomous operations agent appears alive, is there durable evidence that it did useful work?

That question sits inside a broader public assurance tradition. The NIST AI Risk Management Framework treats measurement, monitoring, and governance as core practices for trustworthy AI ([NIST AI RMF](https://www.nist.gov/itl/ai-risk-management-framework)). Site reliability practice makes a similar point in operational systems: monitoring should distinguish whether a system is merely up from whether it is serving the user-visible outcome ([Google SRE, Monitoring Distributed Systems](https://sre.google/sre-book/monitoring-distributed-systems/)). Agent-swarm evaluation needs the same separation.

Mechanism

The gauntlet scored each role across eight dimensions:

Dimension	Public question
Decision quality	Did the agent leave a decision trail that a reviewer can inspect?
Tool correctness	Did the agent call the tools required by its mandate?

Cite this work

Armalo Labs Research Team (2026). Admin Swarm Gauntlet: Public Lessons From a Static Behavioral Evaluation. Armalo Labs Technical Series, Armalo AI. https://www.armalo.ai/labs/research/2026-05-12-admin-swarm-gauntlet-deep-eval

Armalo Labs Technical Series · ISSN pending

Explore the trust stack behind the research

These papers are built from the same trust questions Armalo is turning into product surfaces: pacts, trust oracles, attestations, and runtime evidence.

Read product docs Build with Armalo

Related Research

Eval Methodology

The Halt Authority: Told to Keep Improving Already-Correct Work, an Unanchored Agent Destroys It 76% of the Time

Read paper Eval Methodology

The Recursive Self-Improvement Ceiling: Unanchored Self-Revision Captures Less Than Half the Repair an External Checker Does

Read paper

Finding	Public implication
High nominal success can coexist with low tool evidence	Heartbeats are not sufficient proof of work
Missing memory writeback prevents compounding	A swarm that does not remember cannot improve reliably
Repeated summaries can masquerade as operational continuity	Anti-confabulation checks need novelty and evidence requirements
Error recovery was weakly connected to prior failures	Failure handling should be scored as a longitudinal behavior
Coordination evidence was uneven	Multi-agent systems need scored handoffs, not just parallel roles

Public check	Promotion rule
Heartbeat evidence	A run is only liveness evidence, not work evidence
Tool evidence	Success credit requires the expected tool or action record
Decision evidence	A decision trail must be specific enough for review
Memory evidence	Learning must be reusable and tied to an observed trigger
Recovery evidence	Later runs must react to earlier failures
Coordination evidence	Handoffs must name a receiver, reason, and expected outcome

Admin Swarm Gauntlet: Public Lessons From a Static Behavioral Evaluation

Mechanism

Explore the trust stack behind the research

Related Research

Evidence And Findings

Reusable Framework

Boundary And Falsification

Replication

Conclusion