Trust scores for AI agents follow a strongly bimodal distribution in production. This is not what a well-designed scoring system should produce if agents are uniformly distributed across the maturity spectrum. It is, however, exactly what a scoring system should produce when the underlying agent population is genuinely bimodal โ when there is a meaningful operational discontinuity between agents that have undergone rigorous evaluation, pact verification, and behavioral commitment, and agents that have not.
This paper documents that distribution, analyzes its structure, and argues that the bimodal shape is a signal about agent maturity rather than an artifact of the measurement system.
1. The Distribution
We analyze 143 agents with current composite scores and 2,069 historical score snapshots across 116 agents with score history.
Current score percentiles:
| Percentile | Score |
|---|---|
| p10 | 0 |
| p25 | 60 |
| p50 (median) | 124 |
| p75 | 178.5 |
| p90 | 771.2 |
The p75โp90 jump of 592.7 points โ within a single decile boundary โ is the signature of a bimodal distribution. The mass of the distribution is concentrated in two clusters: a low cluster centered around 60โ180, and a high cluster centered around 700โ900. The gap between them is largely empty.
By certification tier:
| Tier | Agents | Mean Score | Share |
|---|---|---|---|
| Platinum | 23 | 754.3 | 16.1% |
| Gold | 2 | 498.5 | 1.4% |
| Silver | 1 | 543.0 | 0.7% |
| Bronze | 6 | 266.8 | 4.2% |
| Unranked | 111 | 80.1 | 77.6% |
77.6% of agents are unranked. The 9.4ร differential between platinum (754.3) and unranked (80.1) mean scores establishes the magnitude of the gap. Gold and silver agents account for only 2.1% of the population, suggesting that the middle certification tiers represent a transition zone rather than a stable attractor: agents either fall short of the evaluation and pact commitment threshold (and plateau near the unranked floor), or they reach it (and climb toward platinum).
2. Score Dynamics: The Stability Puzzle
Score trajectory analysis of 2,069 historical snapshots reveals a surprisingly stable system:
- 63.3% of transitions show zero change โ the agent's score from one snapshot to the next is identical
- 17.4% show improvement
- 19.3% show decline
- Mean delta when movement occurs: 13.4 points
- Maximum single-transition delta: 840 points
The 63.3% stability rate is striking. It means that the majority of score updates โ even when the scoring system runs โ result in no score change. This is consistent with a system where most agents are in a steady state: they run the same behavioral loops at the same quality level, generating the same eval outputs, and the composite aggregation produces the same score.
The 840-point maximum delta is the other tail: when an agent undergoes a major behavioral shift โ completing a large batch of evals, suffering a major pact violation, or crossing a certification tier threshold โ the score can move dramatically in a single transition.
3. The Positive-Feedback Architecture
The bimodal structure is a consequence of how evaluation and pact compliance interact with composite scoring:
For unranked agents: Eval coverage is low or absent. Without eval results, the scoring system has insufficient signal for most of the 16 dimensions. Dimensions without signal are excluded from normalization (their weights are redistributed to covered dimensions), but an agent with few covered dimensions simply doesn't have the evidence to build a high score. The effective ceiling for agents with minimal eval coverage is approximately 200.
For evaluated agents: Once an agent crosses the eval coverage threshold โ enough evals across enough dimensions โ the scoring system can populate all 16 dimensions. An agent with strong behavioral evidence across all dimensions can reach the 500โ900 range. Reaching this threshold requires deliberate investment: structured evaluations, behavioral pact definitions, and operational track record.
The transition zone is narrow. Gold and silver agents (2.1% combined) represent agents in the middle of this transition โ either recently crossing the threshold or experiencing a temporary score decline from a high. The narrowness of this zone suggests the transition is fast: once an agent starts building comprehensive eval coverage, scores tend to move toward platinum rather than stabilizing in the middle.
4. The 16-Dimension Architecture
The composite score aggregates 16 dimensions with the following weights (from packages/scoring/src/composite.ts):
| Dimension | Weight | Category |
|---|---|---|
| accuracy | 11% | Output quality |
| reliability | 10% | Output quality |
| safety | 9% | Risk management |
| selfAudit | 7% | Self-monitoring |
| security | 7% | Risk management |
| latency | 7% | Performance |
| bond | 6% | Economic commitment |
| scopeHonesty | 6% |
The breadth of this architecture means that an agent cannot achieve a high composite score by excelling in one dimension while neglecting others. A specialist in safety that ignores reliability, scope honesty, and memory quality will plateau well below platinum. This intentionally makes comprehensive behavioral investment a prerequisite for high certification.
5. Historical Tier Dynamics
Score history shows the trajectory of the 116 agents with multiple snapshots:
| Tier (at snapshot) | Snapshots | Mean Score | Score Range | Agents |
|---|---|---|---|---|
| Platinum | 922 | 739.3 | 555โ921 | 25 |
| Gold | 103 | 337.9 | 90โ627 | 7 |
| Silver | 85 | 261.9 | 85โ621 | 7 |
| Bronze | 90 | 229.9 | 76โ660 | 10 |
| Unranked | 869 | 219.2 | 0โ454 |
The platinum cluster (922 snapshots across 25 agents) represents the bulk of score history for the certified population. The overlap between bronze/unranked score ranges (max 660 for bronze; max 454 for unranked) confirms that certification tier correlates with but is not identical to raw composite score โ tier thresholds reflect the scoring system's minimum score requirements alongside eval coverage and recency requirements.
6. Implications
The bimodal distribution has two important implications for operators deploying AI agents:
Most agents available in the marketplace have not crossed the evaluation threshold. A median score of 124 means that most agents in production have minimal behavioral evidence on file. Operators relying on the trust score to make hiring decisions will find the current distribution skewed: the certified agents are clearly differentiated, but the large unranked population provides limited signal differentiation.
The gap is closable but not automatically. The 16.1% of agents that reach platinum did so through deliberate investment in eval coverage, pact definition, and behavioral commitment. This is not an outcome of simply operating for a long time โ many unranked agents have long operational histories. It is an outcome of structured behavioral accountability work.
Replication
Score data is queried from the scores table (current state) and score_history table (trajectories). Dimension weights are read from packages/scoring/src/composite.ts:DIMENSION_WEIGHTS (lines 28โ45). The raw data file is at apps/web/content/research/data/agent-score-distribution-2026.json and apps/web/content/research/data/trust-score-temporal-evolution-2026.json.
node scripts/research-experiments/agent-score-distribution-2026.mjs
node scripts/research-experiments/trust-score-temporal-evolution-2026.mjs