Multi-agent systems are typically evaluated by aggregate metrics: total throughput, overall success rate, mean latency. These aggregates are convenient but misleading when the agent population is heterogeneous โ when a system includes both deterministic infrastructure agents that execute the same operation thousands of times per day and probabilistic intelligence agents that attempt complex judgment tasks with inherently uncertain outcomes.
This paper characterizes behavioral specialization in a production multi-agent administrative system across 94,828 heartbeat records covering 60 distinct roles and 310 unique behavioral loops. The goal is to show what role-level analysis reveals that aggregate analysis obscures.
1. System Overview
The Armalo administrative swarm is a persistent multi-agent system running on AWS ECS. Agents operate on scheduled intervals ranging from 10 seconds (copy-trading hot path) to 24 hours (daily intelligence loops). Every agent execution produces a jarvis_heartbeats record capturing role, loop name, outcome, duration, and action category.
Aggregate statistics across all 94,828 heartbeats:
- Unique roles: 60
- Unique loops: 310
- Overall success rate: 62.88%
- Mean duration: 16,052ms
- Observation window: platform lifetime (earliest heartbeats predate the 90-day analysis window)
The 62.88% aggregate success rate is the headline number that aggregate analysis would report. Role-level analysis shows why this number is almost meaningless.
2. Volume Concentration
The top 10 roles by heartbeat volume account for 85.5% of all heartbeats:
| Role | Heartbeats | % of Total | Success Rate | Mean Duration |
|---|---|---|---|---|
| polyexecutor | 25,985 | 27.4% | 7.8% | โ |
| llm-dispatch | 23,315 | 24.6% | 85.1% | 76,875ms |
| rollback-worker | 15,517 | 16.4% | 99.9% | 77ms |
| watchdog | 3,092 | 3.3% | 0.6% | 1,130ms |
| governor | 2,670 | 2.8% | 90.3% |
The top three roles alone (polyexecutor + llm-dispatch + rollback-worker) account for 68.4% of all heartbeats. Any aggregate metric computed over this dataset is dominated by the behavior of these three roles.
3. The Specialization Taxonomy
Role-level analysis reveals five structurally distinct role categories, each with characteristic success rate signatures:
Category 1: Deterministic Execution
Representative role: rollback-worker (15,517 heartbeats, 99.9% success, 77ms mean)
Rollback-worker executes one mechanical operation: check whether a recently deployed agent branch should be rolled back based on predetermined criteria. Every execution either finds nothing to do (fast success) or executes a rollback (also fast, also deterministic). The 99.9% success rate reflects the absence of probabilistic elements: there are no ambiguous inputs, no LLM judgment required, no external state that can be in an unexpected condition.
Category 2: High-Frequency Trading Execution
Representative role: polyexecutor (25,985 heartbeats, 7.8% success)
The 7.8% success rate for polyexecutor does not indicate a broken system โ it indicates a trading execution system operating against a live market. Most order attempts in thin prediction markets fail due to: insufficient liquidity at target price, market closure between signal and execution, or FAK order expiry. A success rate in the 7โ15% range is expected for limit-order execution in illiquid prediction markets. Evaluating polyexecutor against the aggregate 62.88% benchmark would incorrectly classify it as underperforming.
Category 3: Monitoring and Watchdog
Representative role: watchdog (3,092 heartbeats, 0.6% success)
The watchdog's 0.6% success rate reflects its operational semantics: a watchdog runs to check for problems, and "success" means problems were found and acted upon. Most watchdog cycles complete with an empty result set โ no anomalies to report. An empty result is not a failure from the system's perspective; it is a healthy signal. The watchdog's "success" is measured differently from the trading executor's "success."
Category 4: LLM Intelligence Agents
Representative roles: llm-dispatch (85.1%), ceo (95.5%), cto (96.6%), operator (98.5%), cs (96.3%)
Intelligence agents โ roles that invoke LLM calls to perform reasoning, communication, and decision-making โ show a middle-tier success rate in the 80โ98% range. Their success rates reflect a combination of LLM call reliability, tool availability, and the quality of their behavioral loops. The variation within this category (85.1% for llm-dispatch vs. 98.5% for operator) reflects differences in task complexity and execution environment.
Category 5: Autonomous Coding Agents
Representative roles: openai-codex (66.2%), claudecode (85.1%)
Autonomous coding agents show lower success rates than other intelligence agents, consistent with the higher task complexity. Code generation, repository manipulation, and type-check validation have more failure modes than a structured LLM call with a well-defined output format.
4. The 310 Behavioral Loops
310 distinct loop names have executed at least once in the swarm's history. This number reflects the evolutionary nature of the system: loops are added as new capabilities are needed, deprecated loops leave their names in the history, and specialized sub-loops proliferate from parent loops as operational complexity grows.
The 310 loops:60 roles ratio (5.17 loops per role average) understates the concentration: the top roles have dozens of distinct loops while many specialized roles have only one or two. The 310 figure represents the cumulative behavioral vocabulary of the swarm โ the full set of distinct operational patterns that have been executed at least once.
5. Duration Stratification
The 77ms mean for rollback-worker vs. 76,875ms for llm-dispatch reflects the computational cost of different role categories:
| Category | Typical Duration |
|---|---|
| Deterministic execution (rollback-worker) | 77ms |
| Monitoring (governor, watchdog) | 1,017โ1,130ms |
| Intelligence loops (ceo, cto, operator, cs) | 1,258โ1,849ms |
| LLM dispatch chain | 76,875ms |
The llm-dispatch mean of 76,875ms reflects its function: it manages multi-step LLM inference chains that may involve multiple provider calls, retries, and aggregation steps. Its success rate of 85.1% is achieved over very long execution windows.
6. Implications for Multi-Agent System Design
The data suggests three design principles for production multi-agent systems:
Success rate is role-relative. A 7.8% success rate for a trading execution agent is acceptable; a 7.8% success rate for a customer service agent is not. Aggregate success rate metrics should be replaced with role-normalized scores.
Volume concentration requires explicit management. Three roles handling 68.4% of all heartbeats means that the behavior of those three roles dominates system-wide metrics. Infrastructure for the high-volume roles deserves proportional investment.
Loop proliferation is a signal of system health, not sprawl. 310 behavioral loops serving 60 roles reflects a system that has evolved to handle increasingly specific operational scenarios. Loops that are no longer needed should be retired, but the proliferation itself is expected and desirable in a mature autonomous system.
Replication
node scripts/research-experiments/agent-behavioral-consistency-2026.mjsRaw data: apps/web/content/research/data/agent-behavioral-consistency-2026.json.