Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-05-18-agent-behavioral-consistency. The paper is publicly available and citable.

Behavioral Specialization in Production AI Agent Swarms: A 94,828-Heartbeat Analysis

Q: What is the paper "Behavioral Specialization in Production AI Agent Swarms: A 94,828-Heartbeat Analysis" about?

We analyzed 94,828 heartbeats emitted by 60 distinct agent roles across 310 unique loops in the Armalo admin swarm to characterize behavioral consistency and role specialization in a production multi-agent system. The headline success rate of 62.88% is systematically misleading: two of the highest-volume roles have semantically inverted success semantics, where the expected outcome in a healthy cycle is recorded as a failure. Correcting for this reveals a system where infrastructure roles operate at 85–100% success, coding agents cluster near 66–85%, and the apparent outliers are artefacts of measurement schema rather than operational problems. Concentration is extreme — three roles account for 69% of all heartbeats — and the mean of 5.17 loops per role indicates a degree of functional specialization that has direct implications for how trust scores should be scoped and compared across agents.

Introduction

Autonomous agent swarms generate continuous behavioral telemetry as a byproduct of operation. Every time an agent loop completes — whether it dispatched a trade, reviewed a pull request, routed an LLM call, or scanned the system for anomalies — it emits a heartbeat record: a structured summary of what ran, how long it took, and whether the loop outcome was a success or failure by that role's own criteria.

This telemetry is typically used operationally — to alert on agent death, to debug loop regressions, to verify that scheduled jobs are firing. It is rarely treated as a primary data source for studying behavioral consistency and role specialization at scale. This paper does exactly that.

The Armalo admin swarm is a production multi-agent system running continuously on AWS ECS Fargate. It comprises agents with explicitly differentiated roles: trading execution, LLM routing, code review, rollback arbitration, policy governance, market intelligence, and copy-trade mirroring, among others. Each role operates its own loop structure and emits heartbeats on distinct cadences. The result is a high-resolution behavioral record across a functionally diverse population of agents.

We analyze 94,828 heartbeats spanning 60 unique roles and 310 unique loops to answer three questions: what does the aggregate behavioral signal look like, how much do roles differ from each other, and what does the distribution of activity across roles reveal about system design and specialization?

Section 1: Measurement Design

A heartbeat is a single structured record emitted at the completion of one agent loop iteration. It captures the role identifier, the loop name, start and end timestamps, and an outcome field — typically success, failure, or a role-specific variant. Duration is derived from the timestamp pair.

The critical design question is what "success" and "failure" mean for a given role. The answer is not uniform across the swarm, and naive aggregation of the success field across all roles produces a number that is interpretively misleading.

Consider two cases. The watchdog role runs continuously and scans for system anomalies. In a healthy system, most cycles find nothing to flag. The watchdog's loop is designed to record failure when it finds no active issues — the absence of a findable problem is the expected default state, not a loop malfunction. A watchdog success rate near zero is the signature of a quiet, well-functioning system. A watchdog success rate near 100% would indicate a system under continuous stress.

The polyexecutor role submits limit orders to a prediction market exchange. Exchange markets are thin; FAK (Fill-and-Kill) orders frequently do not fill. An unfilled order is recorded as a non-success outcome because no execution occurred, but this is the expected distribution in low-liquidity conditions — it is not a bug in the executor or evidence of agent malfunction. High volume with low fill rate is the normal operational profile for a market-making agent working against a sparse orderbook.

These two roles together represent a substantial fraction of the total heartbeat corpus. Their semantics must be treated as a correction factor before any aggregate success rate is interpreted as a system health indicator.

Section 2: Aggregate Results

Across 94,828 heartbeats, the overall success rate is 62.88% and the mean loop duration is 16,052ms.

The 62.88% figure is not the right headline for system health. It includes 25,985 heartbeats from polyexecutor (7.8% success by trading-fill semantics) and 3,092 heartbeats from watchdog (0.6% success by anomaly-detection semantics). Together these two roles contribute 29,077 heartbeats — 30.7% of the total corpus — with success rates that are semantically inverted relative to operational health.

Excluding these two roles from the aggregate changes the picture substantially. The remaining 65,751 heartbeats span roles where a success outcome reflects genuine loop completion. Among these, the weighted success rate is materially higher than 62.88%.

The appropriate interpretation of the 62.88% figure is: this is what a naive aggregation of a functionally diverse swarm looks like when roles with inverted success semantics are not separated out. It is a measurement schema artifact, not a reliability signal. Any trust scoring framework that applies a single success-rate threshold across all agent roles without accounting for role-specific outcome semantics will produce systematically distorted scores for watchdog-type and market-execution-type agents.

The mean duration of 16,052ms is similarly heterogeneous. The rollback-worker completes its median cycle in 77ms. Other roles involving LLM inference, multi-step orchestration, or on-chain operations run for tens of seconds. The mean smooths over a multi-order-of-magnitude distribution.

Section 3: Role-Level Analysis

The top 10 roles by heartbeat volume reveal three distinct behavioral archetypes.

Role	Heartbeats	Success Rate	Notes
polyexecutor	25,985	7.8%	Trading fill semantics — unfilled orders are expected
llm-dispatch	23,315	85.1%	Infrastructure router
rollback-worker	15,517	99.9%	Mean duration 77ms; deterministic arbitration
watchdog	3,092	0.6%	Inverted semantics — 0% = quiet system
governor	2,670	90.3%	Policy enforcement
operator	2,571

High-volume infrastructure roles (llm-dispatch, rollback-worker): These run at very high frequency and exhibit success rates of 85.1% and 99.9% respectively. The rollback-worker's 77ms mean duration indicates highly deterministic execution — it performs structured decision logic with minimal external dependencies. The llm-dispatch router's 85.1% rate reflects the realistic failure modes of a system routing LLM calls across multiple providers with external API dependencies.

Low-rate operational roles (polyexecutor, watchdog): Both exhibit low success rates for structurally correct reasons. Polyexecutor's 7.8% fill rate reflects exchange liquidity conditions. Watchdog's 0.6% success rate reflects a low-incidence alert environment. Neither is malfunctioning.

Specialist roles (governor, operator, copy-trading, polyscout): These run at lower volume but exhibit high reliability. Governor at 90.3%, operator at 98.5%, copy-trading at 100.0%, and polyscout at 99.4% represent roles whose scope is tightly constrained and whose success criteria align naturally with the binary success field. These are the roles where the success metric is most directly interpretable.

Coding agents (openai-codex, claudecode): Both exhibit success rates in the 66–85% range. Coding tasks involve multi-step synthesis with real failure modes — type errors, merge conflicts, test failures — so non-trivial failure rates are expected and informative. The 18.9 percentage point gap between claudecode (85.1%) and openai-codex (66.2%) is a measured behavioral difference across two coding agents operating in the same codebase.

Section 4: Specialization Structure

Three roles — polyexecutor, llm-dispatch, and rollback-worker — account for 69% of all 94,828 heartbeats. This is a highly concentrated activity distribution. The remaining 57 roles collectively account for 31% of swarm activity.

The mean loop specialization is 5.17 loops per role, with a maximum of 310 loops for the single most subdivided role. The high maximum indicates that some roles are decomposed into a large number of distinct loop types — each capturing a specific sub-task of a broader role's mission. The mean of 5.17 suggests this level of decomposition is atypical; most roles operate a small number of loops.

The concentration of activity in three roles is not a design flaw — it reflects the operational reality that LLM routing, trading execution, and rollback arbitration are the highest-frequency operations in the swarm. A trading system submits orders continuously; an LLM router handles every inference request; a rollback arbiter runs on every code change. These are structurally high-cadence functions.

The implication is that the swarm's behavioral signal is dominated by these three roles. Aggregate metrics — mean success rate, mean duration, total throughput — primarily reflect the behavior of polyexecutor, llm-dispatch, and rollback-worker. Monitoring frameworks that do not disaggregate by role will have their aggregate indicators driven almost entirely by these three, masking the behavior of the other 57 roles.

Section 5: Implications for Trust Infrastructure

Behavioral consistency across time is a core input to agent trust scoring. An agent that performs its loops at a stable success rate and stable duration across measurement windows is demonstrating one of the properties that trust certification tries to capture.

But this analysis reveals that cross-role comparison of success rates is structurally invalid without role-specific semantic correction. Comparing polyexecutor's 7.8% to rollback-worker's 99.9% as if both numbers mean the same thing would produce a distorted trust ranking. Any trust scoring system that ingests raw success-rate heartbeat data without role-aware semantic normalization will misrank agents systematically.

The practical requirement is role-scoped trust scoring: each role's behavioral consistency is evaluated against its own historical baseline and its own semantic definitions of success, not against a universal threshold. A watchdog that maintains 0.5–0.7% success rate across 90 days is demonstrating consistent behavior. A watchdog that suddenly spikes to 15% success rate is flagging a system anomaly — which is operationally significant information, just in the opposite direction from the usual reading.

Roles that should receive independent trust scores based on this data include any role where success semantics differ from the swarm mean, any role with high enough volume to produce a statistically meaningful independent signal, and any role where the activity concentration is sufficient to dominate aggregate metrics. By those criteria, polyexecutor, llm-dispatch, rollback-worker, watchdog, and the two coding agents all warrant distinct trust tracking.

The coding agents merit particular attention. With 1,887 and 1,507 heartbeats respectively, openai-codex and claudecode have enough volume to produce stable performance estimates. Their 18.9 percentage point success rate gap is large enough to be actionable — it is a measured behavioral difference between two agents operating in the same environment on the same class of task.

Section 6: Limits and Future Work

Several limitations constrain interpretation of this analysis.

Loop attribution gaps. The 310 unique loops are not uniformly distributed, and for some roles the loop-level breakdown is not available in the aggregate data used here. Role-level analysis masks intra-role behavioral heterogeneity — a role that operates 50 distinct loops may have high success rates on 48 and systematic failure on 2, which the role-level success rate will not reveal.

Cross-role comparison caveats. The success field semantics are role-defined and were not systematically audited across all 60 roles before this analysis. The semantic inversion for watchdog and polyexecutor was identified from existing documentation and code notes; other roles may have partially inverted or context-dependent semantics that this analysis does not capture.

Duration heterogeneity. The mean duration of 16,052ms is driven by high-duration roles. The rollback-worker's 77ms and trading roles' short cycles anchor one end; LLM-intensive roles anchor the other. Duration distribution analysis (not just means) would better characterize the agent population.

Temporal scope. This analysis treats the full heartbeat corpus as a static snapshot. Behavioral consistency is a temporal property — it requires comparing a role's current behavioral distribution to its baseline across time windows. A single-snapshot analysis can characterize the current state but cannot distinguish stable behavior from recent drift.

Future work should disaggregate to the loop level within each role, apply time-windowed consistency analysis, and develop semantic normalization schemas for the success field across all 60 roles. The heartbeat corpus is a rich source of behavioral signal; the analysis here represents the aggregate surface.

Replication

All measurements in this paper were produced by running:

The published measurement artifact named in the claims registry is the reproducibility anchor; reviewers can recompute the aggregates from that artifact without public exposure of internal runner paths.

Raw output is committed at the published measurement artifact. All numeric claims in this paper are read directly from that file. No values were estimated, projected, or derived independently of the measurement script output.