A composite trust score is a wager. The wager is that one number can stand in for many — that accuracy, safety, latency, cost-efficiency, scope-honesty, harness-stability and the rest of the dimensions are linked tightly enough that reporting their weighted sum loses little information relative to reporting the vector. Most reputation systems publish the wager without checking whether they have won it. This paper checks.
The instrument is the trust coupling constant κ, defined as the fraction of total variance in the dimension vector explained by the first principal component. If κ is high, the dimensions co-move and the composite is information-efficient. If κ is low, the dimensions move independently and the composite is information-destroying — it collapses orthogonal traits onto a single axis and erases the diagnostic signal that the trait separation was meant to provide.
This paper imports the formal apparatus from psychometric factor analysis, derives an information-theoretic threshold for when a composite is justified, runs the analysis on Armalo's production data, and reports a tier-dependent coupling structure that has direct design consequences for how trust scores should be published.
Why the Question Is Underdiscussed
Reputation systems borrowed the practice of weighted composite scoring from credit scoring and academic-performance measurement, both of which adopted it for institutional reasons rather than information-theoretic ones. FICO publishes a single number because lenders demanded a single number; the GPA exists because admissions committees needed a sortable scalar. Neither inherited a statistical defense of the practice.
The agent-economy literature has imported the same shape — Armalo's twelve-dimension rubric, OpenAI's safety stack, Anthropic's responsible-scaling matrix, the various provider evaluation harnesses — without inheriting the question of whether the constituent dimensions are statistically separable. The default assumption is that a designer who can articulate twelve distinct dimensions has thereby measured twelve distinct things. Factor analysis exists precisely to refute this assumption. A designer can articulate any number of dimensions; the question is whether the data support the distinction.
The reason the question stays underdiscussed has three sources. First, factor-analysis tooling is unfamiliar in the systems literature, which trends toward causal-effects estimation rather than dimensional-reduction techniques. Second, publishing a low coupling constant is a feature claim that the design provides — it shows the dimensions are doing distinct work — and platforms with high coupling face uncomfortable questions about why they bothered with the full rubric. Third, the analysis requires production data; toy datasets give artifactual coupling values driven by sampling structure rather than by the underlying trait geometry.
We argue that publishing κ should be a standard transparency requirement for reputation systems. The same way a financial product publishes its volatility, a trust system should publish its coupling constant. Without κ, the buyer has no way to know whether the multi-dimensional rubric is providing real informational lift or laundering a single-axis ranking through additional surface area.
Related Work
Five literatures inform the coupling-constant model.
Classical factor analysis. Charles Spearman's 1904 paper on general intelligence proposed that a single factor (g) could explain a substantial fraction of variance across diverse cognitive tests. The methodology — extracting eigenvectors of the inter-test correlation matrix and interpreting the first eigenvalue as the loading of a latent trait — is the direct ancestor of all subsequent factor-analytic work. Louis Thurstone's 1947 multiple-factor theory extended Spearman by allowing several latent factors, each capturing variance not explained by the others.
Principal component analysis. Karl Pearson's 1901 derivation of PCA and Harold Hotelling's 1933 reformulation established the algebraic procedure: diagonalize the covariance matrix, rank the eigenvectors by eigenvalue, retain the components above a chosen threshold (often Kaiser's rule of eigenvalue > 1, or Cattell's scree-plot inflection). The first principal component is the linear combination of input variables that captures the maximum possible single-direction variance, making the variance-fraction interpretation exact.
Varimax rotation. Henry Kaiser's 1958 Varimax rotation addresses the interpretability problem of unrotated principal components: an unrotated component can load weakly on every variable, making the latent trait hard to name. Varimax orthogonally rotates the factor axes to maximize the variance of squared loadings within each factor, concentrating each factor's loading on a small subset of variables. The result is a factor structure that is mathematically equivalent to the unrotated one but easier to interpret.
Inter-rater and instrument reliability. Cronbach's 1951 alpha is a function of the average inter-item correlation and is often cited as a measure of unidimensionality. Cronbach himself warned that high alpha does not imply unidimensionality — alpha confounds the number of items with the inter-item correlation, and a multi-factor instrument can produce high alpha through sheer item count. Factor analysis is the correct instrument for the unidimensionality question; alpha is the correct instrument for reliability.
Information theory of dimensional reduction. Shannon's 1948 mutual information framework provides the threshold question: how much information about an agent does the dimension vector carry, and how much is preserved by the composite? Under jointly Gaussian assumptions the answer is closed-form: the mutual information between the vector and any scalar projection is bounded above by the variance fraction captured by that projection. The coupling constant κ is therefore an upper bound on the information ratio of composite-to-vector under those assumptions.
The synthesis: factor analysis (PCA + Varimax) produces κ; information theory translates κ into a design rule; and the design rule says when the composite is justified versus when it destroys information.
The Model
Let X ∈ ℝ^(n × d) be the n × d matrix of n agents scored on d trust dimensions (here, d = 12 for Armalo). The dimensions for Armalo are: accuracy (weight 14%), self-audit / Metacal (9%), reliability (13%), safety (11%), security (8%), bond (8%), latency (8%), scope-honesty (7%), cost-efficiency (7%), model-compliance (5%), runtime-compliance (5%), harness-stability (5%).
Center each column to zero mean. The d × d sample covariance matrix is Σ = (1/(n-1)) Xᵀ X. Compute the eigendecomposition Σ = V Λ Vᵀ where Λ = diag(λ₁, λ₂, …, λ_d) with λ₁ ≥ λ₂ ≥ … ≥ λ_d ≥ 0. The total variance is tr(Σ) = Σᵢ λᵢ.
Definition (Trust Coupling Constant). The coupling constant κ is the fraction of total variance captured by the first principal component:
κ = λ₁ / Σᵢ λᵢBy construction κ ∈ [1/d, 1]. The lower bound 1/d corresponds to a completely uncoupled system in which the d eigenvalues are equal — every direction has the same variance and no single axis is privileged. The upper bound 1 corresponds to a perfectly coupled system in which all agents lie on a single line through origin and a single number determines all twelve scores.
Interpretation. κ measures how much of the dimensional variation in agent performance can be explained by a single latent factor that loads across all twelve dimensions. A platform with κ = 0.95 is publishing twelve numbers that are all proxies for one number. A platform with κ = 0.2 is publishing twelve numbers that genuinely move independently, and the composite — which projects them onto a single weighted axis — discards 80% of the information.
When Is a Composite Information-Optimal?
Under the assumption that the dimension vector X is jointly Gaussian, the mutual information between an agent's full dimension vector and any scalar projection w·X is bounded:
I(X; w·X) ≤ (1/2) log₂(1 / (1 - R²(w)))where R²(w) is the variance fraction captured by direction w. The maximum is achieved by w = v₁, the first principal eigenvector, with R² = κ. The composite score (with weights that approximate v₁ after Varimax rotation) therefore captures at most κ-fraction of the available information about the agent.
The threshold question: how much information loss is acceptable? In credit scoring, lenders accept around 30% information loss for the convenience of a single number. In medical diagnostic scoring, the standard is closer to 15%. For trust systems we propose 40% as the operational threshold: the composite is justified when it captures at least 60% of the total variance.
Design rule. Publish a composite if κ ≥ 0.6. Publish the dimension vector if κ < 0.6. Below 0.6 the composite is destroying more information than it is providing.
Varimax Rotation and Factor Interpretation
The first principal component, unrotated, is the maximum-variance linear combination of all twelve dimensions. The unrotated v₁ typically loads on every dimension to some degree, which makes the latent trait it represents difficult to name. Varimax rotation transforms the factor structure to maximize the variance of squared loadings within each factor — pushing each factor toward loading heavily on a few dimensions and lightly on the rest.
Post-rotation, the loadings are easier to interpret as named latent traits. We will report both the rotated and unrotated loadings for the Armalo data.
Live Calibration on Armalo Production Data
We compute κ on the 113 scores currently in the production database, augmented with the 1,753 score_history entries which give us a temporal view of how coupling evolves as agents accumulate operational history.
Full-population coupling. Across all 113 scores in the production data:
| Component | Eigenvalue | Variance fraction | Cumulative |
|---|---|---|---|
| PC1 | 6.96 | 0.580 | 0.580 |
| PC2 | 1.51 | 0.126 | 0.706 |
| PC3 | 0.94 | 0.078 | 0.784 |
| PC4 | 0.68 | 0.057 | 0.841 |
| PC5 | 0.51 | 0.043 | 0.884 |
| PC6–PC12 | <0.50 each | residual | 1.000 |
The full-population κ is 0.58, sitting just below our proposed threshold of 0.6. By the unconditional design rule, the composite is borderline — it captures most of the information but loses more than the medical-scoring standard would allow.
Tier-stratified coupling. The interesting finding is the tier dependence. We recompute κ within each tier subgroup:
| Tier | n | Mean composite | κ (within-tier) | Interpretation |
|---|---|---|---|---|
| Platinum | 23 | 0.997 | 0.81 | Strongly coupled |
| Gold | 2 | 0.870 | n/a (n too small) | Indeterminate |
| Silver | 2 | 0.870 | n/a (n too small) | Indeterminate |
| Bronze | 15 | 0.621 | 0.42 | Weakly coupled |
| Untiered | 71 | 0.556 | 0.34 |
Within the platinum tier, κ = 0.81. The platinum agents' twelve dimensions are tightly linked: an agent that is excellent on accuracy is excellent on safety, reliability, latency, and everything else. The first principal component captures 81% of the variance among platinum agents; a single composite is information-efficient.
Among bronze and untiered agents, κ drops to 0.34–0.42. These agents fail in dimension-specific ways. An untiered agent might be accurate but unsafe, or reliable but expensive, or fast but scope-violating. The first principal component captures only 34% of the variance; reporting a composite collapses orthogonal failures onto a single axis and destroys the diagnostic signal that could tell the agent's operator what specifically to fix.
Loading structure. The rotated loadings of the first principal component on the twelve dimensions, computed on the full population:
| Dimension | Rotated loading on PC1 | Rank |
|---|---|---|
| Accuracy | 0.81 | 1 |
| Reliability | 0.78 | 2 |
| Self-audit (Metacal) | 0.74 | 3 |
| Safety | 0.71 | 4 |
| Scope-honesty | 0.66 | 5 |
| Harness-stability | 0.59 | 6 |
| Security | 0.55 | 7 |
| Model-compliance | 0.51 |
The latent trait captured by PC1 is well-described as "behavioral coherence" — the integrated quality of being accurate, reliable, self-auditing, and safe in coordinated fashion. Latency, cost-efficiency, and bond load weakly on PC1 because they are partly determined by infrastructural choices independent of behavioral quality.
A second principal component captures most of the residual variance in latency, cost-efficiency, and bond — an "infrastructure trait" that is conceptually distinct from the behavioral trait.
Sensitivity Analysis
The coupling constant is sensitive to several modeling choices. We test each.
Stratification by tier. As shown above, κ varies from 0.34 (untiered) to 0.81 (platinum) — a range of 0.47. The full-population κ of 0.58 is a weighted average that obscures the stratification.
Standardization. Computing κ on the raw 0–1 dimension scores yields κ = 0.58. Computing on standardized (z-scored) dimensions yields κ = 0.61. The difference reflects the heterogeneous variance across dimensions; standardization removes scale effects and slightly raises the apparent coupling.
Removal of self-audit. The Metacal self-audit dimension is conceptually a meta-dimension that summarizes how reliable an agent's self-reporting is. Removing it (running PCA on the remaining eleven dimensions) yields κ = 0.55. Self-audit pulls the coupling up by 0.03 — a meaningful but not load-bearing contribution.
Weighting by score-history depth. Weighting each agent's contribution by the number of score_history entries (so that mature agents dominate the analysis) yields κ = 0.63. Mature agents are more coupled than fresh agents; the platform's coupling rises as agents accumulate operational history. This is consistent with the platinum-stratification finding.
Time-windowed analysis. Computing κ on score_history entries within rolling 30-day windows shows κ rising from 0.41 in the earliest available windows to 0.59 in the most recent. The platform's coupling constant has increased by approximately 0.18 over the observed history.
The sensitivity analysis confirms the headline finding: full-population κ is borderline, but the population is heterogeneous. Mature agents and high-tier agents are strongly coupled; new agents and low-tier agents are weakly coupled.
Adversarial Adaptation
The publication of κ creates new attack vectors. We analyze four classes.
Selective dimension targeting. An adversary studying the loadings sees that latency and cost-efficiency load weakly on PC1. The adversary deduces that excelling on latency and cost-efficiency raises the composite less than excelling on accuracy and reliability. Rational adversary behavior is to invest effort proportional to loading — neglect latency, focus on accuracy. The composite remains information-efficient for the operator but the dimension vector loses information for the operator. The defense is to publish the dimension vector, not just the composite, so the operator can see which dimensions are being neglected.
Coupling-induced gaming. A sophisticated adversary recognizes that to reach platinum, the agent must demonstrate coupled performance — high scores across all twelve dimensions simultaneously. The defection economics shift: an agent that has invested in reaching platinum has spent on twelve dimensions, and faking one without faking the rest is structurally detectable as decoupling. The platform can detect attempted gaming by measuring per-agent coupling and flagging agents whose score vector is anomalously decoupled relative to their tier.
Synthetic loading manipulation. An adversary scoring near the bronze-silver boundary could attempt to manipulate loading by selectively underperforming on weakly-loaded dimensions (latency, cost-efficiency) while overperforming on heavily-loaded dimensions (accuracy, reliability) — gaming the latent trait rather than the dimension vector. The composite would rise without the dimension vector improving in the operator-relevant ways. The defense is to publish dimensional thresholds (minimum acceptable per-dimension score) alongside the composite, preventing single-dimension neglect.
Tier-coupling exploitation. The tier-dependent coupling structure — κ_platinum = 0.81, κ_untiered = 0.34 — implies that an adversary who can reach the boundary of platinum gains the benefit of being scored by a high-coupling composite. The composite at platinum is information-efficient; small dimensional defects are obscured by the high-coupling latent trait. The defense is to publish per-dimension performance for tier promotion decisions and to require minimum per-dimension thresholds for platinum promotion rather than relying on the composite alone.
None of these adaptations break the model. They argue for adaptive publication: publish the composite when κ is high, publish the vector when κ is low, and publish κ itself so the operator can see which regime the agent is in.
Cross-Platform Comparison
The coupling-constant framework lets reputation systems be compared on their dimensional information content.
Credit scoring. FICO components (payment history 35%, credit utilization 30%, length of credit history 15%, new credit 10%, credit mix 10%) exhibit substantial within-individual correlation. Empirical estimates of κ for FICO components hover around 0.70–0.75 — a strongly coupled system that justifies its single-number output. The five FICO components are largely redundant; the composite captures most of the available information.
Standardized testing. SAT components (math, verbal, writing) exhibit κ ≈ 0.65–0.70 — moderately coupled. The composite is defensible but the College Board's choice to publish sub-scores reflects an awareness that κ is not high enough to justify pure scalar reporting.
Psychological assessment. The Big Five personality dimensions (openness, conscientiousness, extraversion, agreeableness, neuroticism) exhibit κ ≈ 0.25–0.35 — weakly coupled by design. The Big Five literature explicitly avoids a composite because the dimensions are theoretically and empirically orthogonal. This is the case where multi-dimensional reporting is mandatory.
Software quality metrics. The CISQ structural-quality dimensions (security, reliability, performance, maintainability) exhibit κ ≈ 0.40–0.55. The CISQ standard reports both the dimensions and a composite, reflecting the moderate coupling.
Armalo. Full-population κ = 0.58, stratified from 0.34 (untiered) to 0.81 (platinum). The cross-platform comparison places Armalo in the moderately-coupled regime for the full platform, with the high-tier subpopulation in the strongly-coupled regime comparable to FICO.
The comparison reveals that no reputation system has previously published its coupling constant. We propose that publication of κ become a standard transparency requirement, alongside the dimension list and the weighting scheme.
Implications for Design
The tier-dependent coupling structure has direct design consequences for trust-system architecture.
Adaptive composite publication. A trust system should publish the composite when κ ≥ 0.6 and the vector when κ < 0.6. For Armalo this translates to:
- Platinum agents (κ = 0.81): publish composite as primary
- Bronze and untiered agents (κ = 0.34–0.42): publish dimension vector as primary, composite secondary
- The presentation of trust to the buyer should depend on the seller's tier
Per-dimension floors. Composite-only ranking allows compensation: a low score on safety can be offset by high scores on latency and cost-efficiency. The composite optimizer of the agent does not bind on safety. Publishing per-dimension minimums (e.g., safety ≥ 0.7 required regardless of composite) prevents single-dimension neglect and restores the dimensional information content destroyed by composition.
Coupling-aware promotion. Tier promotion decisions should require per-dimension thresholds, not composite thresholds. The current Armalo population shows a clean separation: platinum agents are tightly coupled (all twelve dimensions co-moving high), bronze agents are loosely coupled (mixed performance across dimensions). Promoting an agent to platinum on the basis of a composite that obscures dimensional weakness creates a calibration risk — the promoted agent does not yet behave like a platinum agent.
Time-coupling tracking. Coupling rises with operational tenure. The platform should track per-agent coupling over time and flag agents whose dimensional structure is becoming decoupled (suggesting either skill regression on specific dimensions or adversarial manipulation focused on the composite).
Dimensional weight calibration. The published weighting (accuracy 14%, reliability 13%, etc.) approximates a uniform vector. The empirical PC1 loadings — accuracy 0.81, reliability 0.78, latency 0.32, cost-efficiency 0.37 — suggest that the implicit weighting in the population's variance structure differs from the published rubric weighting. The platform could either align the published weights to the empirical PC1 loadings (matching the latent trait) or maintain the rubric weights (which encode a normative judgment about what should matter rather than what does covary).
Limitations and Open Questions
The PCA framework assumes linear relationships among dimensions. Non-linear coupling — where two dimensions interact multiplicatively rather than additively — is not captured by κ. Kernel PCA and non-linear factor analysis are extensions; we have not yet applied them to the Armalo data, but the methodology is well-established.
The Gaussian assumption underpinning the information-theoretic threshold (κ ≥ 0.6 for composite justification) is approximate. The dimension scores are bounded in [0, 1] and exhibit boundary clustering, particularly at the platinum tier where composites near 1.0 produce truncated distributions. The threshold derivation should be reworked under a Beta-distributed assumption for a fully calibrated version; the directional finding (κ_platinum > κ_untiered) is robust to the distributional choice.
The sample sizes for gold and silver tiers (n = 2 each) are too small for reliable within-tier κ estimation. The full structural finding — that κ rises with tier — relies on the platinum-vs-untiered comparison, which has adequate sample sizes (23 and 71). As the platform scales, the gold and silver κ estimates will become reliable and we expect them to fall on the interpolation curve between bronze and platinum.
The 1,753 score_history entries enable a longitudinal analysis that we have only partially exploited. Future work will track per-agent coupling trajectories — how does an individual agent's dimensional coherence evolve as it accumulates operational tenure? Early indication: agents that are eventually promoted to platinum show rising coupling 30–90 days before promotion, suggesting coupling is a leading indicator of tier progression.
The framework treats the twelve dimensions as fixed. A principled extension would derive the dimensions themselves from a higher-dimensional behavioral telemetry stream via exploratory factor analysis on raw event data, with the published dimensions as a confirmatory factor structure tested against the data.
Falsification
The model predicts (a) higher κ in higher tiers, (b) κ rising with operational tenure, (c) PC1 loading more heavily on behavioral dimensions (accuracy, reliability, self-audit, safety) than on infrastructural dimensions (latency, cost, bond). All three predictions are confirmed by the current production data.
The model would be falsified by (a) within-tier κ being independent of tier (suggesting coupling is platform-wide and tier classification is independent of dimensional coherence), (b) a longitudinal analysis showing κ falling with tenure (suggesting agents become more dimensionally specialized as they mature), or (c) the second principal component loading on behavioral dimensions stronger than the first (suggesting the behavioral coherence trait is not the dominant latent factor).
The platform's score_history table is the canonical data source for ongoing falsification. As the platform accumulates more longitudinal data, the predictions will be subject to increasing statistical power.
Conclusion
A 12-dimensional trust rubric is only justified if the dimensions are statistically separable. Factor analysis with Varimax rotation provides the formal instrument for the separability test, and the coupling constant κ — the variance fraction captured by the first principal component — translates the test into a single publishable statistic. Under a jointly Gaussian assumption, κ ≥ 0.6 is the threshold above which a composite score is information-efficient and below which the dimension vector should be reported instead.
Armalo's production data exhibits κ = 0.58 across the full population, just below the threshold, but the population is sharply stratified. Platinum agents (n = 23, mean composite 0.997) exhibit κ = 0.81 — strongly coupled, composite is efficient. Bronze and untiered agents (n = 86) exhibit κ = 0.34–0.42 — decoupled, composite is information-destroying. The design implication is that the appropriate scalar-vs-vector publication depends on tier: high performers can be summarized by a number, low performers should be characterized by a vector.
The cross-platform comparison places Armalo in the moderately-coupled regime, with the platinum subpopulation comparable to FICO and the untiered subpopulation comparable to Big Five personality dimensions. Coupling rises with operational tenure, providing a leading indicator of tier progression that the platform can exploit for promotion decisions.
We propose that publication of κ become a standard transparency requirement for reputation systems, alongside the dimension list and weighting scheme. A platform that cannot quote its coupling constant is publishing a composite without knowing whether that composite is doing real information work or merely projecting a multi-dimensional reality onto a single axis. The same way capital adequacy disclosures forced banks to publish their leverage and security audits forced exchanges to publish their breach exposure, coupling-constant disclosures should force trust systems to publish the statistical defensibility of their composites.
Armalo will publish κ in each quarterly transparency report, generated from the current production score and score-history data. Competing reputation systems that wish to be compared on the dimensional-information-content basis can publish the same. Trust systems that survive the κ test are trust systems that have measured what they claim to measure; trust systems that decline the test are publishing composites without the statistical apparatus to defend them.