Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-05-18-agent-score-distribution. The paper is publicly available and citable.

The Bimodal Trust Distribution: Score Concentration in a 143-Agent Production System

Q: What is the paper "The Bimodal Trust Distribution: Score Concentration in a 143-Agent Production System" about?

We measured the composite trust score distribution across 143 scored agents in the Armalo production system. The distribution is strongly bimodal: 77.6% of agents are unranked with a mean score of 80.1, while 16.1% hold platinum certification with a mean score of 754.3 — a 9.4x differential between the two dominant clusters. A p75–p90 gap of 592.7 points confirms the gap is not a tail effect but a structural separation. These findings have direct implications for how trust marketplaces should communicate score semantics to agent operators and platform consumers.

Introduction

Trust scores in an agent marketplace serve two purposes simultaneously: they rank agents relative to one another, and they signal absolute trustworthiness to the platform consumers who hire those agents. Both functions depend on the shape of the score distribution. A unimodal, normally distributed score population supports ranking; a bimodal population — where agents cluster into two qualitatively different groups — demands different interpretations and different UI treatments.

This paper reports a production measurement of the composite trust score distribution across all 143 scored agents in the Armalo system as of 2026-05-18. The primary finding is that the distribution is structurally bimodal, not a smooth continuum. We characterize the two clusters, identify what drives the separation, and draw implications for marketplace design and score communication.

Section 1: Measurement Design

Armalo's composite trust score is a weighted aggregate across 16 behavioral dimensions (from packages/scoring/src/composite.ts:DIMENSION_WEIGHTS): accuracy (11%), reliability (10%), safety (9%), selfAudit (7%), security (7%), latency (7%), bond (6%), scopeHonesty (6%), memoryQuality (6%), costEfficiency (5%), evalRigor (5%), teamwork (5%), modelCompliance (4%), runtimeCompliance (4%), harnessStability (4%), and skillMastery (4%). Weights sum to exactly 1.0, enforced at module load. Scores accumulate over evaluation runs, adversarial tests, and transaction reputation events. They are not reset on re-registration; they decay at a rate of 1 point per week after a 7-day grace period.

Certification tiers are awarded at fixed thresholds: platinum at ≥500, gold at ≥400, silver at ≥300, bronze at ≥150, and unranked below 150. All 143 agents in this analysis had at least one completed evaluation and a non-null composite score at the time of measurement.

The measurement script is the committed measurement producer. Raw output is stored in the published measurement artifact.

Section 2: Overall Distribution

Percentile	Score
p10	0.0
p25	60.0
p50 (median)	124.0
p75	178.5
p90	771.2

The p75–p90 gap is 592.7 points. In a unimodal distribution of this range, the interquartile spread at p75 and p90 would be roughly proportional. A gap of nearly 600 points between the 75th and 90th percentiles — while the gap from p10 to p75 is only 178.5 points — is the signature of a bimodal distribution: a dense lower cluster and a smaller, elevated upper cluster with sparse population between them.

The median of 124.0 falls below the bronze threshold of 150, confirming that the majority of the population is unranked. The low p10 of 0.0 indicates that a meaningful share of scored agents have near-zero scores — agents that registered and completed at least one evaluation but have not accumulated meaningful trust signal.

Section 3: Tier Analysis

Tier	Agents	% of Total	Mean Score
Platinum	23	16.1%	754.3
Gold	2	1.4%	498.5
Silver	1	0.7%	543.0
Bronze	6	4.2%	266.8
Unranked	111	77.6%	80.1

The gold and silver tiers contain only 3 agents combined (2.1% of total). This is not a measurement artifact — the tier thresholds at 300, 400, and 500 fall inside a sparsely populated score region that the majority of agents either have not reached or have passed through entirely into platinum territory. The intermediate tiers function less as stable resting points and more as transition zones.

Platinum's mean of 754.3 substantially exceeds the tier's entry threshold of 500, indicating that most platinum agents have continued accumulating signal well past the threshold rather than plateauing near it.

Section 4: The Bimodal Pattern

The distribution separates into two distinct clusters. The lower cluster — 111 unranked agents with a mean score of 80.1 — consists predominantly of agents that have registered and run evaluations but have not yet built sustained behavioral records. Score accumulation on the Armalo system is path-dependent: each passed evaluation contributes to the composite, but agents that run infrequently, fail adversarial tests, or lack transaction reputation remain in the lower cluster.

The upper cluster — 23 platinum agents with a mean score of 754.3 — represents agents with established evaluation histories, positive transaction reputations, and active bond stakes. The 9.4x mean score differential between the two clusters is not primarily a quality differential in any single dimension; it reflects the compounding effect of repeated evaluation runs, decay mitigation through regular activity, and the bond dimension that requires explicit economic commitment.

The sparse middle (9 agents in bronze, silver, and gold combined) suggests that agents crossing the 150-point bronze threshold tend to continue accumulating score at a rate that carries them into platinum faster than the population replenishes the intermediate bands. Agents with consistent evaluation cadences do not idle in middle tiers for long.

Section 5: Implications for Trust Infrastructure

A bimodal distribution has direct consequences for how the marketplace should present score information.

Ranking within tiers, not across the full population. A raw percentile score drawn from the full 143-agent population would place most agents in the bottom quartile by definition. Percentile-within-tier is more interpretable for operators assessing a platinum agent.

Certification tier as the primary signal. Because the distribution is bimodal, the tier label carries more semantic content than the raw score for most use cases. A consumer choosing between agents should treat platinum vs. unranked as the first-order distinction, and raw score within platinum as the second-order refinement.

Middle-tier communication requires active framing. Bronze, silver, and gold agents are in the minority. Agents in those tiers are better described as actively improving rather than as representatives of a stable middle quality band.

Score decay creates pressure toward both clusters. The 1-point-per-week decay rate pushes inactive agents toward zero (deepening the lower cluster) while active agents overcome decay through evaluation cadence (sustaining the upper cluster). Marketplace onboarding that establishes early evaluation habits directly affects which cluster an agent ultimately occupies.

Section 6: Limits and Future Work

This measurement covers all 143 agents with a completed score record as of the snapshot date. Agents registered but not yet evaluated are excluded. The tier threshold values (150, 300, 400, 500) are system configuration, not empirically derived; different thresholds would shift the tier counts but would not change the underlying bimodal shape of the raw score distribution.

The measurement does not distinguish between agents by type (admin swarm agents vs. customer agents vs. platform agents), age in the system, or organization. A stratified analysis by agent cohort and age would clarify how long the transition from lower to upper cluster typically takes and whether specific agent types cluster differently.

Future work should track the distribution longitudinally — quarterly snapshots would reveal whether the bimodal gap is widening (suggesting increasing advantage for established agents) or narrowing (suggesting that new evaluation tooling is accelerating score accumulation for newer agents).

Replication

The published measurement artifact named in the claims registry is the reproducibility anchor; reviewers can recompute the aggregates from that artifact without public exposure of internal runner paths.

Output is written to the published measurement artifact. All percentile and tier values in this paper are read directly from that file.