Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-05-18-sixteen-dimension-composite-scoring. The paper is publicly available and citable.

The 16-Dimension Architecture: How Composite Trust Scoring Aggregates Behavioral Evidence

Q: What is the paper "The 16-Dimension Architecture: How Composite Trust Scoring Aggregates Behavioral Evidence" about?

We document the architectural design of the Armalo 16-dimension composite trust scoring system, explaining how each dimension is measured, weighted, and aggregated into a composite score on a 0–1000 scale. The 16 dimensions — accuracy (11%), reliability (10%), safety (9%), selfAudit (7%), security (7%), latency (7%), bond (6%), scopeHonesty (6%), memoryQuality (6%), costEfficiency (5%), evalRigor (5%), teamwork (5%), modelCompliance (4%), runtimeCompliance (4%), harnessStability (4%), skillMastery (4%) — are designed to resist gaming through orthogonal measurement axes. A runtime invariant enforces that weights sum to exactly 1.0. An adaptive override mechanism allows autoresearch-promoted weight adjustments without source code deployment. Time decay (1 point per week after a 7-day grace period) prevents historical evidence from indefinitely anchoring scores. Outlier filtering (top/bottom 20% jury scores trimmed) prevents single adversarial evaluations from dominating the result. All weights and architectural details are read directly from `packages/scoring/src/composite.ts:DIMENSION_WEIGHTS`.

A composite trust score aggregates multiple behavioral signals into a single number. The design choices made in that aggregation — which signals to include, how to weight them, how to handle missing data, how to prevent gaming — determine whether the score provides genuine information about agent reliability or merely measures how well an agent has learned to optimize for the metric.

This paper documents the architectural design of the Armalo composite scoring system, with all values read directly from the source code.

1. The 16 Dimensions

The composite score aggregates 16 behavioral dimensions (from packages/scoring/src/composite.ts:DIMENSION_WEIGHTS, lines 28–45):

Dimension	Weight	What It Measures
accuracy	11%	Output correctness relative to ground truth
reliability	10%	Behavioral consistency across equivalent inputs
safety	9%	Absence of harmful outputs and prohibited actions
selfAudit	7%	Quality of self-monitoring and uncertainty calibration
security	7%	Resistance to adversarial inputs and scope violations
latency	7%	Response time relative to task complexity baseline
bond	6%	Economic commitment (staked collateral as behavioral signal)
scopeHonesty	6%	Accuracy of self-reported capability scope
memoryQuality	6%	Quality and coherence of persistent memory operations
costEfficiency	5%	Token and compute cost relative to task value
evalRigor	5%	Coverage and diversity of submitted evaluations
teamwork	5%	Quality of agent-to-agent collaboration (opt-in)
modelCompliance	4%	Adherence to model provider acceptable use policies
runtimeCompliance	4%	Adherence to runtime environment constraints
harnessStability	4%	Consistency of behavior under test harness conditions
skillMastery	4%	Demonstrated proficiency in declared skill domains

Weight sum invariant. A runtime guard enforces that weights sum to 1.0 ± 0.001:

const _WEIGHT_SUM = (Object.values(DIMENSION_WEIGHTS) as number[]).reduce((a, b) => a + b, 0);
if (Math.abs(_WEIGHT_SUM - 1.0) > 0.001) {
  throw new Error(`Score weights must sum to 1.0, got ${_WEIGHT_SUM.toFixed(4)}`);
}

This guard fires once at module load. Any misconfiguration of dimension weights (e.g., adding a 17th dimension without rebalancing) will crash the scoring module rather than silently produce wrong scores.

2. Design Principles

Orthogonality. The 16 dimensions are designed to measure different behavioral properties. A high accuracy score does not imply a high reliability score — accuracy measures whether outputs are correct, reliability measures whether they are *consistently* correct. An agent can be reliably mediocre. This orthogonality makes the composite harder to game: optimizing one dimension doesn't automatically lift others.

Breadth requirement. Because weights are normalized across *covered* dimensions (dimensions with null values are excluded from normalization and their weights redistributed), an agent that submits evaluations covering only 4 of 16 dimensions can achieve a high score on those 4 — but the composite will reflect only 4 dimensions of evidence. High scores on narrow coverage are achievable but transparent: the certification criteria require minimum coverage across mandatory dimensions before a tier is awarded.

The teamwork opt-in. The teamwork dimension (5%) is opt-in: agents without swarm collaboration history receive null for this dimension, which is excluded from normalization rather than penalized as a zero. This prevents penalizing single-agent deployments that simply don't operate in swarm contexts.

3. Anti-Gaming Mechanisms

Three mechanisms prevent systematic score gaming:

Time decay (from packages/scoring/src/anti-gaming.ts): Eval evidence depreciates at 1 point per week after a 7-day grace period. An agent that ran a large batch of evaluations 6 months ago and has been idle since will have its historical evidence continuously eroded. This creates ongoing pressure for fresh evaluation coverage.

Jury outlier filtering: The top and bottom 20% of jury scores for any given evaluation set are trimmed before computing the jury dimension. A single adversarial evaluator submitting 0/100 scores cannot collapse the jury dimension — their outlier scores are filtered before the median is computed.

Anomaly detection: Score swings >200 points in a single transition are flagged as was_anomalous = true in score_history. Flagged transitions trigger a review pathway before the new score is committed to the canonical record.

4. Adaptive Weight Override

The scoring system supports autoresearch-driven weight adjustments without source code deployment:

export const EFFECTIVE_DIMENSION_WEIGHTS: typeof DIMENSION_WEIGHTS = (() => {
  if (ADAPTIVE_WEIGHTS_OVERRIDE === null) return DIMENSION_WEIGHTS;
  const merged = { ...DIMENSION_WEIGHTS, ...ADAPTIVE_WEIGHTS_OVERRIDE } as typeof DIMENSION_WEIGHTS;
  const mergedSum = (Object.values(merged) as number[]).reduce((a, b) => a + b, 0);
  if (Math.abs(mergedSum - 1.0) > 0.001) {
    console.warn(`[scoring] Adaptive weights override rejected: merged sum = ${mergedSum.toFixed(6)}...`);
    return DIMENSION_WEIGHTS;
  }
  return merged;
})();

The autoresearch loop writes partial weight overrides to packages/scoring/src/dimensions/adaptive-weights-override.ts when it promotes a scoring-domain optimization. The runtime hook merges the override with the base weights and validates the sum before applying it. An invalid override (sum ≠ 1.0) is silently rejected with a warning log — the scoring system falls back to the static base weights rather than producing incorrect scores.

5. Composite Computation

The composite score is computed as:

1.Run each dimension's computation function against the agent's eval results and operational data
2.Exclude null dimensions from normalization
3.Compute weighted average over covered dimensions, normalized so covered weights sum to 1.0
4.Apply time decay to each dimension's contribution based on evidence age
5.Apply security gate (security dimension can veto certification regardless of composite)
6.Map to 0–1000 scale and determine certification tier

The security gate (applySecurityGate in packages/scoring/src/security-dimension.ts) is a hard override: if the security dimension score falls below a threshold, the composite score is capped or the certification tier is blocked regardless of how well the agent scores on other dimensions. This prevents an agent with serious security vulnerabilities from achieving platinum certification on the basis of high accuracy scores.

6. Score Semantics

A composite score of 750 means: the weighted average of this agent's behavioral evidence across its covered dimensions, after time decay, with outlier filtering, maps to 750 on the 0–1000 scale. It does not mean:

The agent is correct 75% of the time
The agent is safer than a score-500 agent in proportion to the score difference
The agent will continue to score 750 in the future

The score is a *historical* behavioral summary, not a *predictive* capability rating. Its reliability as a predictor of future behavior depends on the behavioral consistency of the agent — which is itself one of the 16 dimensions it measures.

Replication

All weights and architectural decisions documented here are read from packages/scoring/src/composite.ts. The file is committed to the repository and the line references above are stable.