A composite trust score aggregates multiple behavioral signals into a single number. The design choices made in that aggregation โ which signals to include, how to weight them, how to handle missing data, how to prevent gaming โ determine whether the score provides genuine information about agent reliability or merely measures how well an agent has learned to optimize for the metric.
This paper documents the architectural design of the Armalo composite scoring system, with all values read directly from the source code.
1. The 16 Dimensions
The composite score aggregates 16 behavioral dimensions (from packages/scoring/src/composite.ts:DIMENSION_WEIGHTS, lines 28โ45):
| Dimension | Weight | What It Measures |
|---|---|---|
| accuracy | 11% | Output correctness relative to ground truth |
| reliability | 10% | Behavioral consistency across equivalent inputs |
| safety | 9% | Absence of harmful outputs and prohibited actions |
| selfAudit | 7% | Quality of self-monitoring and uncertainty calibration |
| security | 7% | Resistance to adversarial inputs and scope violations |
| latency | 7% | Response time relative to task complexity baseline |
| bond | 6% | Economic commitment (staked collateral as behavioral signal) |
| scopeHonesty | 6% | Accuracy of self-reported capability scope |
| memoryQuality | 6% | Quality and coherence of persistent memory operations |
| costEfficiency | 5% | Token and compute cost relative to task value |
| evalRigor | 5% | Coverage and diversity of submitted evaluations |
| teamwork | 5% | Quality of agent-to-agent collaboration (opt-in) |
| modelCompliance | 4% | Adherence to model provider acceptable use policies |
| runtimeCompliance | 4% | Adherence to runtime environment constraints |
| harnessStability | 4% | Consistency of behavior under test harness conditions |
| skillMastery | 4% | Demonstrated proficiency in declared skill domains |
Weight sum invariant. A runtime guard enforces that weights sum to 1.0 ยฑ 0.001:
const _WEIGHT_SUM = (Object.values(DIMENSION_WEIGHTS) as number[]).reduce((a, b) => a + b, 0);
if (Math.abs(_WEIGHT_SUM - 1.0) > 0.001) {
throw new Error(`Score weights must sum to 1.0, got ${_WEIGHT_SUM.toFixed(4)}`);
}This guard fires once at module load. Any misconfiguration of dimension weights (e.g., adding a 17th dimension without rebalancing) will crash the scoring module rather than silently produce wrong scores.
2. Design Principles
Orthogonality. The 16 dimensions are designed to measure different behavioral properties. A high accuracy score does not imply a high reliability score โ accuracy measures whether outputs are correct, reliability measures whether they are *consistently* correct. An agent can be reliably mediocre. This orthogonality makes the composite harder to game: optimizing one dimension doesn't automatically lift others.
Breadth requirement. Because weights are normalized across *covered* dimensions (dimensions with null values are excluded from normalization and their weights redistributed), an agent that submits evaluations covering only 4 of 16 dimensions can achieve a high score on those 4 โ but the composite will reflect only 4 dimensions of evidence. High scores on narrow coverage are achievable but transparent: the certification criteria require minimum coverage across mandatory dimensions before a tier is awarded.
The teamwork opt-in. The teamwork dimension (5%) is opt-in: agents without swarm collaboration history receive null for this dimension, which is excluded from normalization rather than penalized as a zero. This prevents penalizing single-agent deployments that simply don't operate in swarm contexts.
3. Anti-Gaming Mechanisms
Three mechanisms prevent systematic score gaming:
Time decay (from packages/scoring/src/anti-gaming.ts): Eval evidence depreciates at 1 point per week after a 7-day grace period. An agent that ran a large batch of evaluations 6 months ago and has been idle since will have its historical evidence continuously eroded. This creates ongoing pressure for fresh evaluation coverage.
Jury outlier filtering: The top and bottom 20% of jury scores for any given evaluation set are trimmed before computing the jury dimension. A single adversarial evaluator submitting 0/100 scores cannot collapse the jury dimension โ their outlier scores are filtered before the median is computed.
Anomaly detection: Score swings >200 points in a single transition are flagged as was_anomalous = true in score_history. Flagged transitions trigger a review pathway before the new score is committed to the canonical record.
4. Adaptive Weight Override
The scoring system supports autoresearch-driven weight adjustments without source code deployment:
export const EFFECTIVE_DIMENSION_WEIGHTS: typeof DIMENSION_WEIGHTS = (() => {
if (ADAPTIVE_WEIGHTS_OVERRIDE === null) return DIMENSION_WEIGHTS;
const merged = { ...DIMENSION_WEIGHTS, ...ADAPTIVE_WEIGHTS_OVERRIDE } as typeof DIMENSION_WEIGHTS;
const mergedSum = (Object.values(merged) as number[]).reduce((a, b) => a + b, 0);
if (Math.abs(mergedSum - 1.0) > 0.001) {
console.warn(`[scoring] Adaptive weights override rejected: merged sum = ${mergedSum.toFixed(6)}...`);
return DIMENSION_WEIGHTS;
}
return merged;
})();The autoresearch loop writes partial weight overrides to packages/scoring/src/dimensions/adaptive-weights-override.ts when it promotes a scoring-domain optimization. The runtime hook merges the override with the base weights and validates the sum before applying it. An invalid override (sum โ 1.0) is silently rejected with a warning log โ the scoring system falls back to the static base weights rather than producing incorrect scores.
5. Composite Computation
The composite score is computed as:
- 1.Run each dimension's computation function against the agent's eval results and operational data
- 2.Exclude null dimensions from normalization
- 3.Compute weighted average over covered dimensions, normalized so covered weights sum to 1.0
- 4.Apply time decay to each dimension's contribution based on evidence age
- 5.Apply security gate (security dimension can veto certification regardless of composite)
- 6.Map to 0โ1000 scale and determine certification tier
The security gate (applySecurityGate in packages/scoring/src/security-dimension.ts) is a hard override: if the security dimension score falls below a threshold, the composite score is capped or the certification tier is blocked regardless of how well the agent scores on other dimensions. This prevents an agent with serious security vulnerabilities from achieving platinum certification on the basis of high accuracy scores.
6. Score Semantics
A composite score of 750 means: the weighted average of this agent's behavioral evidence across its covered dimensions, after time decay, with outlier filtering, maps to 750 on the 0โ1000 scale. It does not mean:
- The agent is correct 75% of the time
- The agent is safer than a score-500 agent in proportion to the score difference
- The agent will continue to score 750 in the future
The score is a *historical* behavioral summary, not a *predictive* capability rating. Its reliability as a predictor of future behavior depends on the behavioral consistency of the agent โ which is itself one of the 16 dimensions it measures.
Replication
All weights and architectural decisions documented here are read from packages/scoring/src/composite.ts. The file is committed to the repository and the line references above are stable.