Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-05-18-autonomous-swarm-behavioral-specialization. The paper is publicly available and citable.

310 Behavioral Loops, 60 Roles: Measuring Specialization in a Production Multi-Agent System

Q: What is the paper "310 Behavioral Loops, 60 Roles: Measuring Specialization in a Production Multi-Agent System" about?

We characterize behavioral specialization patterns in a production multi-agent system comprising 60 distinct agent roles executing across 310 behavioral loops over 94,828 recorded heartbeats. The system exhibits extreme volume concentration: the top three roles (polyexecutor, llm-dispatch, rollback-worker) account for 68.6% of all heartbeats. Specialization is measurable by role: rollback-worker achieves 99.9% success rate with 77ms mean duration (mechanical execution), while polyexecutor achieves 7.8% success rate with no duration data (trading execution with high natural failure rate). The overall system success rate is 62.88%, but this aggregate masks a 7.8%–100% range across roles. We argue that cross-role success rate aggregation is meaningless without role-aware normalization, and present a specialization taxonomy distinguishing deterministic-execution, probabilistic-execution, monitoring, trading, and intelligence roles. All data reproducible from the committed measurement producer.

Multi-agent systems are typically evaluated by aggregate metrics: total throughput, overall success rate, mean latency. These aggregates are convenient but misleading when the agent population is heterogeneous — when a system includes both deterministic infrastructure agents that execute the same operation thousands of times per day and probabilistic intelligence agents that attempt complex judgment tasks with inherently uncertain outcomes.

This paper characterizes behavioral specialization in a production multi-agent administrative system across 94,828 heartbeat records covering 60 distinct roles and 310 unique behavioral loops. The goal is to show what role-level analysis reveals that aggregate analysis obscures.

1. System Overview

The Armalo administrative swarm is a persistent multi-agent system running on AWS ECS. Agents operate on scheduled intervals ranging from 10 seconds (copy-trading hot path) to 24 hours (daily intelligence loops). Every agent execution produces a jarvis_heartbeats record capturing role, loop name, outcome, duration, and action category.

Aggregate statistics across all 94,828 heartbeats:

Unique roles: 60
Unique loops: 310
Overall success rate: 62.88%
Mean duration: 16,052ms
Observation window: platform lifetime (earliest heartbeats predate the 90-day analysis window)

The 62.88% aggregate success rate is the headline number that aggregate analysis would report. Role-level analysis shows why this number is almost meaningless.

2. Volume Concentration

The top 10 roles by heartbeat volume account for 85.5% of all heartbeats:

Role	Heartbeats	% of Total	Success Rate	Mean Duration
polyexecutor	25,985	27.4%	7.8%	—
llm-dispatch	23,315	24.6%	85.1%	76,875ms
rollback-worker	15,517	16.4%	99.9%	77ms
watchdog	3,092	3.3%	0.6%	1,130ms
governor	2,670	2.8%	90.3%

The top three roles alone (polyexecutor + llm-dispatch + rollback-worker) account for 68.4% of all heartbeats. Any aggregate metric computed over this dataset is dominated by the behavior of these three roles.

3. The Specialization Taxonomy

Role-level analysis reveals five structurally distinct role categories, each with characteristic success rate signatures:

Category 1: Deterministic Execution

Representative role: rollback-worker (15,517 heartbeats, 99.9% success, 77ms mean)

Rollback-worker executes one mechanical operation: check whether a recently deployed agent branch should be rolled back based on predetermined criteria. Every execution either finds nothing to do (fast success) or executes a rollback (also fast, also deterministic). The 99.9% success rate reflects the absence of probabilistic elements: there are no ambiguous inputs, no LLM judgment required, no external state that can be in an unexpected condition.

Category 2: High-Frequency Trading Execution

Representative role: polyexecutor (25,985 heartbeats, 7.8% success)

The 7.8% success rate for polyexecutor does not indicate a broken system — it indicates a trading execution system operating against a live market. Most order attempts in thin prediction markets fail due to: insufficient liquidity at target price, market closure between signal and execution, or FAK order expiry. A success rate in the 7–15% range is expected for limit-order execution in illiquid prediction markets. Evaluating polyexecutor against the aggregate 62.88% benchmark would incorrectly classify it as underperforming.

Category 3: Monitoring and Watchdog

Representative role: watchdog (3,092 heartbeats, 0.6% success)

The watchdog's 0.6% success rate reflects its operational semantics: a watchdog runs to check for problems, and "success" means problems were found and acted upon. Most watchdog cycles complete with an empty result set — no anomalies to report. An empty result is not a failure from the system's perspective; it is a healthy signal. The watchdog's "success" is measured differently from the trading executor's "success."

Category 4: LLM Intelligence Agents

Representative roles: llm-dispatch (85.1%), ceo (95.5%), cto (96.6%), operator (98.5%), cs (96.3%)

Intelligence agents — roles that invoke LLM calls to perform reasoning, communication, and decision-making — show a middle-tier success rate in the 80–98% range. Their success rates reflect a combination of LLM call reliability, tool availability, and the quality of their behavioral loops. The variation within this category (85.1% for llm-dispatch vs. 98.5% for operator) reflects differences in task complexity and execution environment.

Category 5: Autonomous Coding Agents

Representative roles: openai-codex (66.2%), claudecode (85.1%)

Autonomous coding agents show lower success rates than other intelligence agents, consistent with the higher task complexity. Code generation, repository manipulation, and type-check validation have more failure modes than a structured LLM call with a well-defined output format.

4. The 310 Behavioral Loops

310 distinct loop names have executed at least once in the swarm's history. This number reflects the evolutionary nature of the system: loops are added as new capabilities are needed, deprecated loops leave their names in the history, and specialized sub-loops proliferate from parent loops as operational complexity grows.

The 310 loops:60 roles ratio (5.17 loops per role average) understates the concentration: the top roles have dozens of distinct loops while many specialized roles have only one or two. The 310 figure represents the cumulative behavioral vocabulary of the swarm — the full set of distinct operational patterns that have been executed at least once.

5. Duration Stratification

The 77ms mean for rollback-worker vs. 76,875ms for llm-dispatch reflects the computational cost of different role categories:

Category	Typical Duration
Deterministic execution (rollback-worker)	77ms
Monitoring (governor, watchdog)	1,017–1,130ms
Intelligence loops (ceo, cto, operator, cs)	1,258–1,849ms
LLM dispatch chain	76,875ms

The llm-dispatch mean of 76,875ms reflects its function: it manages multi-step LLM inference chains that may involve multiple provider calls, retries, and aggregation steps. Its success rate of 85.1% is achieved over very long execution windows.

6. Implications for Multi-Agent System Design

The data suggests three design principles for production multi-agent systems:

Success rate is role-relative. A 7.8% success rate for a trading execution agent is acceptable; a 7.8% success rate for a customer service agent is not. Aggregate success rate metrics should be replaced with role-normalized scores.

Volume concentration requires explicit management. Three roles handling 68.4% of all heartbeats means that the behavior of those three roles dominates system-wide metrics. Infrastructure for the high-volume roles deserves proportional investment.

Loop proliferation is a signal of system health, not sprawl. 310 behavioral loops serving 60 roles reflects a system that has evolved to handle increasingly specific operational scenarios. Loops that are no longer needed should be retired, but the proliferation itself is expected and desirable in a mature autonomous system.

Replication

The published measurement artifact named in the claims registry is the reproducibility anchor; reviewers can recompute the aggregates from that artifact without public exposure of internal runner paths.

Raw data: the published measurement artifact.