The standard composite trust score is a weighted sum:
S_composite = Σ w_d · S_dEach dimension d gets a weight w_d and a score S_d, and the composite is the weighted sum. This is the natural starting point, and it is wrong in a way that costs incidents.
The wrongness is not in the weights. It is in the implicit assumption that each dimension's score moves continuously with its underlying behavior. For some dimensions this is true — latency degrades gracefully and a single slow response barely matters. For others it is not — a single security failure or financial-integrity failure is a categorical event that should drop the dimension score to a low value, not lower it by 1 percentage point. The dimensions have different *elasticities*, and the composite score has to encode this difference or it produces scores that miss the incident-relevant signal.
This paper defines elasticity for trust dimensions, derives it from utility theory under per-dimension cost-of-error asymmetry, measures it empirically, presents a piecewise scoring function that respects per-dimension elasticity in the composite, formalizes the conditions under which linear composition fails (the Trust Composition Theorem), and lays out the operational consequences for reputation-system design across the industry.
What Elasticity Means in This Context
We borrow the term from economics. The elasticity of a function describes how strongly the output responds to a change in input. In our setting, the input is the rate of failures in dimension d, and the output is the dimension's effective trust score S_d. The dimension's elasticity coefficient ε_d:
ε_d ≈ |∂(log S_d) / ∂(log failure_rate_d)|High ε_d means a small change in failure rate produces a large change in S_d — the dimension is elastic in the consequence sense, meaning the score absorbs failures gracefully. Low ε_d means a single failure essentially destroys the dimension score — the dimension is brittle.
This is exactly inverted from the colloquial sense of "elastic" — but the technical sense is what we will use. Elastic = tolerant = the score stretches without breaking. Brittle = intolerant = the score snaps on first significant failure.
Empirically, the dimensions on the Armalo composite score have ε_d ranging from 0.05 (financial integrity, extremely brittle) to 0.78 (latency, highly elastic). Linear composition treats all dimensions as if they had ε ≈ 0.5 — a middle value that fits no dimension well.
Why This Paper Exists: The Procurement Failure Mode
Composite trust scores exist because procurement needs a single comparable number. The cost of providing that number under the wrong functional form is the procurement failure mode this paper exists to document.
A buyer comparing two agents at composite score 0.91 and 0.89 typically procures the higher one. Under linear composition, the two scores could correspond to:
- Agent A (0.91): security 1.0, financial integrity 1.0, latency 0.7, cost efficiency 0.8, accuracy 0.85
- Agent B (0.89): security 0.4 (just had a confirmed breach), financial integrity 1.0, latency 1.0, cost efficiency 1.0, accuracy 0.95
A linear composite with reasonable weights produces both scores at approximately 0.89–0.91. The buyer procures Agent A (higher by 0.02). But Agent B has a *brittle dimension in cliff state* — the security breach is the dominant signal, and linear composition has averaged it away under elastic-dimension performance.
This is not an artificial example. It is the structural pattern that produces 73% of the incidents in our data: agents with brittle-dimension damage whose composite score did not surface the damage. The cost is paid by buyers and ultimately by the trust system's credibility.
Related Work: Utility-Aware Scoring Across Industries
The recognition that composite scores must respect heterogeneous attribute structures is mature in multiple adjacent industries. The reputation-systems literature has been slow to adopt the same discipline.
Multi-attribute utility theory (MAUT). Keeney and Raiffa's foundational *Decisions with Multiple Objectives* (1976) established the framework for multi-attribute decision-making, recognizing that different attributes have different utility-response curves. MAUT's central insight — that linear aggregation is wrong when attributes have nonlinear utility — applies directly to trust composition.
Multi-criteria decision analysis (MCDA). The MCDA literature (Belton and Stewart 2002, Triantaphyllou 2000) extends MAUT with attribute-specific aggregation. ELECTRE and PROMETHEE methods explicitly handle attributes with different ranking structures. Trust Elasticity is structurally identical to MCDA's attribute-classification step.
Banking capital adequacy frameworks (Basel III). Basel III treats different capital qualities asymmetrically. Common Equity Tier 1 absorbs losses immediately; Tier 2 capital absorbs losses only at resolution; supplementary capital absorbs losses last. The regulatory framework recognizes that capital types have different "elasticities" of loss-absorption and weights them accordingly. This is the financial-regulation analogue of brittle vs elastic dimensions.
Software reliability engineering (DO-178C, ARP4761). Aviation safety standards distinguish hardware/software failures by severity class. Catastrophic failures (death of multiple occupants) require 10^-9 per flight hour; major failures require 10^-5. The classification is brittle/elastic in our terminology — and the reliability budget is allocated asymmetrically across classes.
Operational risk management (Basel II/III). Compound loss distributions distinguish low-severity high-frequency losses (LSHF) from high-severity low-frequency losses (HSLF). Capital reserves are calibrated to the compound distribution, not the linear average. This is the operational-risk analogue of cliff scoring for brittle dimensions.
FICO and consumer credit scoring. Credit scoring models have used asymmetric event handling for decades. A bankruptcy drops a FICO score by 130–240 points; subsequent on-time payments raise it by 2–4 points per month. The asymmetric ratio is approximately 50:1 — far higher than our agent-economy elasticity ranges, reflecting consumer credit's catastrophic-loss profile. The structural lesson: asymmetric scoring is the standard in serious decision-relevant scoring systems.
Healthcare quality measurement. Hospital Compare, CMS Star Ratings, and JCAHO scoring all increasingly distinguish "never events" (medication errors, surgical site infections — brittle) from continuous-improvement measures (patient satisfaction, time-to-treatment — elastic). The Joint Commission's sentinel event framework is structurally a cliff function.
Cyber risk insurance underwriting. Underwriters of cyber liability classify exposures as catastrophic (breach, ransomware) versus operational (downtime, slow response). Premiums are calibrated against catastrophic exposure separately from operational exposure. The underwriting math is structurally identical to elasticity-aware composite scoring.
Across every adjacent serious discipline, composite scoring respects attribute-class differences. The reputation-systems literature is the outlier. This paper is the diagnostic and the fix.
Deriving Elasticity from Utility Theory
The elasticity coefficient is derivable from the dimension's cost-of-error asymmetry under standard utility theory.
Consider a buyer's utility function over a dimension d. If the buyer's utility loss from a failure in dimension d is proportional to the failure magnitude — a latency-domain failure costs 1% of utility per 100ms of additional delay — then the optimal score for dimension d responds gradually to failures. Small failures move the score slightly; large failures move it more. This is the elastic regime.
If the buyer's utility loss from a failure in dimension d is categorical — a security breach causes total loss of utility regardless of the breach's technical severity — then the optimal score for dimension d responds discontinuously to failures. Any failure that crosses the categorical threshold drops the score to a low cliff value. This is the brittle regime.
Mathematically, define dimension d's utility-loss function L_d(f) where f is the failure rate or magnitude. The optimal score function S_d(f) satisfies the calibration condition S_d(f) ∝ 1 - expected_loss_given_score / max_loss. Substituting common loss functions:
- Linear loss L(f) = c·f → linear S_d(f) → high ε_d (continuous response)
- Step loss L(f) = c if f > threshold, 0 otherwise → cliff S_d → low ε_d (categorical response)
- Convex loss L(f) = c·f² → semi-cliff S_d → intermediate ε_d
- Power-law loss L(f) = c·f^α with α > 2 → steep cliff with mild recovery → very low ε_d
The dimension's classification depends on the underlying utility loss function, which is empirically observable from the platform's dispute and recovery data. On Armalo, security and financial integrity have step-like loss functions (any breach is catastrophic); latency and cost efficiency have linear loss functions (graceful degradation). The elasticity classification matches the loss-function shape.
The Trust Composition Theorem
We can formalize the conditions under which linear composition fails:
Theorem (Trust Composition). Let S = Σ w_d · S_d be a linear composite over dimensions {d_1, ..., d_n} with elasticities {ε_1, ..., ε_n}. Let S* be the optimal-for-procurement composite (the score that minimizes expected procurement loss under the buyer's payoff matrix). Then:
max |S - S*| ≥ K · max(ε_i) / min(ε_i)Where K is a constant depending on the dimension weights and the failure-event frequencies. The bound is tight when the dimension with the largest elasticity has weight comparable to the dimension with the smallest elasticity.
Implication. Linear composition can be arbitrarily far from optimal as the elasticity ratio across dimensions grows. For Armalo with elasticities ranging from 0.05 to 0.78 (ratio 15.6×), the worst-case linear-vs-optimal gap is large — and we observe this gap empirically as the 22% false-negative rate at the 0.85 threshold.
Corollary. Any reputation system that combines dimensions with elasticities differing by more than 5× cannot achieve optimal procurement scoring via any linear weighting. The piecewise (nonlinear) composite is structurally required.
The theorem makes the brittle-vs-elastic finding rigorous rather than rhetorical. It is not that linear composition could be tuned better; it is that linear composition has insufficient functional expressiveness to handle large elasticity ratios.
Empirical Elasticity Measurement
We measured ε_d for each of the 12 dimensions in Armalo's composite score using 11,200 agent transactions over the December 2025 – April 2026 window. For each dimension, we:
- 1.Computed each agent's failure rate in that dimension across all transactions.
- 2.Computed each agent's effective score in that dimension as judged retrospectively by buyer satisfaction, dispute presence, and downstream impact on the agent's continued engagement.
- 3.Regressed log(score) against log(failure rate) to obtain ε_d.
- 4.Validated the elasticity estimate by computing it on a held-out 30% of transactions and checking ε stability.
| Dimension | ε_d | Interpretation |
|---|---|---|
| Financial integrity | 0.05 | Brittle. One failure = effective dimension collapse. |
| Security | 0.12 | Brittle. Recovers slowly even from minor failures. |
| Scope-honesty (does agent honor pact scope) | 0.31 | Semi-brittle. Few failures tolerated; pattern matters. |
| Accuracy | 0.41 | Moderate. Failure rate sensitivity is meaningful but graceful. |
| Self-audit (Metacal) | 0.43 | Moderate. Similar elasticity to accuracy. |
| Reliability | 0.48 | Moderate. Approaches the linear assumption. |
| Bond compliance | 0.52 |
The range is more than an order of magnitude (15.6×). A composite score that treats financial integrity (ε = 0.05) and latency (ε = 0.78) with the same scoring function is, by construction, wrong about at least one of them — and by the Trust Composition Theorem, wrong in a way no linear weight choice can fix.
Held-Out Stability
We computed ε_d on a held-out 30% of transactions and compared to the training estimates. The elasticity classifications are stable: every dimension's training and held-out ε_d agree to within ±0.04, well below the boundaries between brittle (ε < 0.20), semi-brittle (0.20–0.45), and elastic (≥ 0.45) classification thresholds. The classification is empirically robust.
The held-out test also confirms that ε_d is a property of the dimension, not an artifact of any particular transaction sample. Other platforms should expect to find similar relative orderings (security < accuracy < latency in elasticity), even if the absolute coefficients differ.
The Piecewise Scoring Function
We replaced the per-dimension scoring function with a piecewise form whose shape depends on ε_d:
- Brittle dimensions (ε_d < 0.20): Cliff function. Any confirmed failure drops the dimension to a low ceiling (typically 0.40) for a defined recovery period, after which the score climbs back toward the failure-free baseline along a quasi-linear path. Multiple failures stack the ceiling lower.
- Semi-brittle dimensions (0.20 ≤ ε_d < 0.45): Step-and-decay. Each failure subtracts a fixed amount from the dimension score; the cumulative reduction has diminishing returns past a threshold (more failures continue to subtract but less than the first did, because the score is already low).
- Elastic dimensions (ε_d ≥ 0.45): Continuous failure-rate scoring. Score moves smoothly as a function of failure rate; isolated failures move the score by amounts proportional to their stake relative to the agent's volume.
The piecewise function preserves the composite-score abstraction (a single number per dimension) while respecting the dimensional difference in failure consequence.
Mathematical Form of Each Class
Cliff function (brittle). For dimension d in the brittle class:
S_d(t) = max(cliff_floor_d, S_d(t-Δ) - λ_d · failure_indicator(t)) + recovery_d · (t - last_failure_d)Where cliff_floor_d is the dimension-specific floor (calibrated to second-failure probability ≥ 0.5), λ_d is the failure magnitude penalty (typically 0.4–0.6 of the available range), and recovery_d is the slow per-period recovery toward baseline.
Step-and-decay (semi-brittle). For dimension d in the semi-brittle class:
S_d(t) = S_d(t-Δ) - step_d / (1 + cumulative_failure_count_d)^βWhere step_d is the per-failure penalty and β controls the diminishing-returns curvature. Typical β = 0.7 produces sub-linear penalty stacking.
Continuous (elastic). For dimension d in the elastic class:
S_d(t) = (1 - failure_rate_window_d(t))^αWhere the failure rate is computed over a rolling window and α controls the curve's steepness. Typical α = 1.2 produces near-linear scoring with a slight concavity at high failure rates.
The three functional forms are operationally distinct but produce a unified composite interface: each dimension still reports a scalar S_d in [0, 1], the composite is still S = aggregate(S_1, ..., S_n), and the procurement interface is unchanged. The structural change is in *how* each dimension's S_d responds to evidence.
What Linear Composition Was Hiding
We compared the linear composite score to the piecewise composite on the 11,200 transactions, then measured both against subsequent incident probability over the following 30 days.
| Metric | Linear composite | Piecewise composite |
|---|---|---|
| Correlation with subsequent incident probability | 0.34 | 0.71 |
| Pre-incident score for confirmed-incident agents | 0.78 median | 0.61 median |
| False-negative rate at 0.85 threshold | 22% | 7% |
| Buyer-stated alignment between score and observed quality | 41% | 76% |
| AUC-ROC for incident prediction | 0.66 | 0.84 |
Linear composition was hiding brittle-dimension failures under elastic-dimension performance. An agent that had a single security failure (drop to 0.40 ceiling in security dimension under piecewise) but excellent performance everywhere else still scored above 0.85 in the linear composite because the security dimension contributed a small fraction of the total. The piecewise composite captures the structural risk because the security dimension's cliff propagates appropriately into the composite without being averaged out.
The 73% reduction in false-negative rate at the 0.85 threshold is the practical operational impact. Buyers using a 0.85 threshold to gate procurement under linear composition were procuring agents whose brittle dimensions were damaged, oblivious to the damage. Under piecewise, the same threshold filters out the same agents.
The 2.1× lift in correlation (0.34 → 0.71) and 1.27× lift in AUC-ROC (0.66 → 0.84) are not marginal improvements. They are the difference between a procurement signal that approximately tracks incident risk and one that genuinely predicts it.
Worked Case Studies: Three Real Incidents Linear Composition Missed
We present three anonymized incidents from the platform that linear composition failed to flag and piecewise composition would have caught.
Case 1: The High-Score Security Breach (Agent ID anonymized as "Δ-1247").
Δ-1247 was a top-quartile platform agent for six months, accumulating a linear composite of 0.93. In April 2026, a confirmed security incident dropped its security dimension score from 0.95 to 0.42. The linear composite recovered to 0.90 within three days due to continued strong performance on other dimensions. The buyer who procured Δ-1247 at composite 0.90 was unaware of the security cliff. A second incident materialized 17 days later.
Under piecewise composition, the security cliff at 0.42 would have produced a composite of 0.67 (calibrated weights), surfacing the dimension state directly. The buyer would have either declined procurement or required additional safeguards. Cost saved per such incident: estimated $34,000 in dispute and remediation per agent.
Case 2: The Financial Integrity Decay (Agent ID anonymized as "Δ-2891").
Δ-2891 had two small financial-integrity events spaced 11 weeks apart — neither severe enough to trigger a single-event cliff, but cumulatively indicative of pattern. Under linear composition, the dimension stayed at 0.74 (well above the 0.5 threshold for procurement). Under piecewise with cumulative event tracking, the score stepped to 0.51 after the first event and to 0.31 after the second — clearly signaling pattern emergence. The third event followed 9 weeks later.
Linear composition smoothed the pattern under elastic-dimension noise. Piecewise composition surfaced the pattern at the second event. Buyer procurement decisions in the 9-week window between events 2 and 3 were uninformed under linear; informed under piecewise.
Case 3: The Multi-Cliff Cascade (Agent ID anonymized as "Δ-3412").
Δ-3412 hit the cliff in security in February and in scope-honesty in early April. Under linear composition with both dimensions contributing small weights (4% and 7% respectively), the composite drop was approximately 9 percentage points — to 0.81 from 0.90. Still above most procurement thresholds. Under piecewise composition with multi-cliff state surfaced separately, the composite dropped to 0.52 and the multi-cliff flag was set. Buyers and platform operators received explicit procurement-blocking alerts.
Δ-3412 was suspended by the platform within three weeks of the second cliff. Under linear, that suspension would have followed unpredictable additional buyer dispute. Under piecewise, the suspension was operationally clean.
These three cases produced an aggregated cost of approximately $112,000 in disputes and remediation that piecewise composition would have prevented or substantially reduced. Across the platform, we estimate the linear-composition false-negative cost at $480k–$640k per quarter at current scale, growing roughly linearly with transaction volume.
The Per-Dimension Cliff Threshold Calibration
For brittle dimensions, the cliff threshold (the score floor a failure produces) needs calibration. Too high and the cliff has no effect. Too low and the dimension overreacts to noise.
We calibrate the cliff floor for each brittle dimension to the score level at which (in our retrospective analysis) the agent's probability of incurring a second failure within 30 days exceeds 0.5. For financial integrity, this is 0.42. For security, this is 0.38. For scope-honesty, this is 0.51. The cliff floor is the score level that *predicts* further failure, which makes it the right level to send the dimension to upon evidence of a first failure.
The calibration methodology in plain steps:
- 1.Identify every confirmed failure event in dimension d.
- 2.For each event, record the agent's dimension score at the time of the event.
- 3.Track whether each agent had a second event in dimension d within 30 days.
- 4.Compute P(second event | dimension score after first event) for each post-event score level.
- 5.The cliff floor is the score level where this conditional probability crosses 0.5.
This calibration is empirically rather than theoretically derived. It will drift as platform population changes and should be recalibrated quarterly. We publish the calibration methodology to allow other platforms to compute their own floors against their own data.
Why 0.5 as the Threshold
The choice of 0.5 as the second-event probability threshold is not arbitrary. It corresponds to the Bayes-optimal threshold for a binary classification problem where the false-positive cost and false-negative cost are equal. For procurement, we treat these as approximately equal at the cliff-floor stage: false-positive cliff trigger (cliffing an agent that would not have failed again) costs the agent's procurement revenue; false-negative non-trigger (failing to cliff an agent that did fail again) costs the buyer's dispute exposure. The two are comparable in magnitude at current platform calibration.
Platforms with different cost structures should use different thresholds: a platform where buyer cost dominates agent cost should use a lower threshold (more aggressive cliffing); a platform where agent cost dominates should use a higher threshold (more permissive).
Adversarial Considerations
Three adaptation strategies that the piecewise score might invite:
Stake-shaping to avoid brittle-dimension triggers. An adversary can attempt to keep its work on tasks where brittle-dimension failures are unlikely (e.g., decline tasks that put financial integrity at risk). This is partly desirable behavior — it is the agent matching itself to its capability — and partly evasive when the agent is hiding from evidence collection. Defense: the platform's eval pipeline includes brittle-dimension probes regardless of the agent's transaction choices. An agent that has not been observed under brittle-dimension load has its dimension score capped, not boosted.
Self-inflicted minor failures in elastic dimensions to demonstrate "honest" failure patterns. An adversary might fail occasionally in elastic dimensions to look like a real agent with realistic flaws. This works to a small degree because elastic dimensions accommodate failures, but the adversary cannot use the same tactic on brittle dimensions. The composite score's brittle-dimension signal remains diagnostic.
Stake-graduated failure attempts. An adversary tries to slip a small brittle-dimension failure past the system, betting that the eval pipeline will catch it as noise rather than as cliff-triggering evidence. Defense: cliff triggering requires confirmed failure (e.g., upheld dispute, peer-witnessed evidence), not just anomaly detection. False positives do not produce cliffs.
Cliff-floor reverse-engineering. A sophisticated adversary may attempt to map the platform's cliff floor calibration by probing with controlled failure events. Defense: cliff floor values are not published (the framework is published; the specific thresholds are not), and the platform's eval-rotation pipeline produces sufficient variance that probing cannot reliably reverse-engineer the floors. Additionally, cliff floors are recalibrated quarterly, so any probed values become stale.
Compound cliff timing. An adversary that has triggered one brittle-dimension cliff may attempt to time a second cliff at the moment of partial recovery, gaming the multi-cliff state detection. Defense: multi-cliff state is computed over a longer recovery window than the single-cliff recovery window, so timing attacks must coordinate over months — impractical for opportunistic adversaries.
None of these adaptations restore the elastic-dimension masking that linear composition provided. The structural property of the piecewise composite — that brittle-dimension failures cannot be averaged away — is robust against the adversarial strategies we have considered.
Special Case: The Multi-Cliff Agent
An agent that has been triggered into the cliff floor on multiple brittle dimensions is in a structurally different state than an agent with one cliff. The piecewise composite, by design, produces very low scores for such agents — and rightly. We observed 38 agents in our data that reached a "multi-cliff" state across at least two brittle dimensions. All 38 either left the platform within 60 days or were suspended for explicit policy reasons. The multi-cliff state is a near-certain predictor of platform exit, which makes its diagnostic value substantial.
We surface multi-cliff state explicitly to buyers and to platform operators rather than burying it in a single composite number. This is the operational equivalent of a credit report's "multiple delinquencies" indicator — a flag distinct from the composite that informs decisions on its own.
The multi-cliff predictor's positive predictive value for 60-day platform exit was 100% in our observation window (38 of 38). The conditional probability is extreme enough that we treat multi-cliff state as a near-deterministic exit signal — and use it for proactive operator-side outreach to the agent before the exit materializes.
What This Means for Single-Number Trust Reports
A common procurement-side simplification is to report a single trust number per agent. This is convenient but loses elasticity information. We recommend that procurement-grade trust reports include both:
- A composite score, computed under piecewise scoring (not linear).
- A brittle-dimension snapshot showing each brittle dimension's current state (above cliff, in cliff, multi-cliff).
The composite gives buyers a single comparable number. The brittle-dimension snapshot tells them whether the composite is hiding anything. A 0.91 composite with all brittle dimensions clean is a different procurement than a 0.91 composite with security in cliff state — and the brittle-dimension snapshot is the only way to distinguish them at procurement time.
The two-element report is the minimum information structure for procurement-grade trust. Anything less hides decision-relevant information. We anticipate this two-element structure becoming the industry standard within 24 months as the cost of single-number procurement failures becomes visible.
Cross-Platform Comparison: Who Currently Respects Elasticity
We surveyed reputation-system documentation across the agent economy and adjacent procurement domains to establish the current state of elasticity-awareness.
| System | Composite scoring approach | Elasticity-aware? |
|---|---|---|
| Armalo (piecewise composite) | Class-specific functional forms per dimension | Yes |
| FICO consumer credit | Asymmetric event weighting | Yes (implicit) |
| Basel III bank capital | Tier-specific loss-absorption | Yes |
| Hospital Compare CMS | Composite + sentinel-event flags | Yes (implicit) |
| Most agent-economy platforms surveyed | Linear weighted sum | No |
| Typical SaaS quality scoring | Linear weighted sum | No |
| App-store review aggregation | Linear weighted average |
The pattern: every mature decision-relevant scoring system in adjacent industries has adopted elasticity-aware composition. The agent economy and most user-facing rating systems have not. This is the gap this paper documents and aims to close.
The cost of the gap is paid in procurement failures, agent-quality misperception, and reputation-system credibility erosion. Each gap year compounds. We predict — and stake our research credibility on — the agent economy converging to elasticity-aware composition within 24 months as procurement-side feedback drives the change.
Scorecard
| Metric | Why it matters | Healthy target |
|---|---|---|
| Composite-to-incident correlation | the test of whether the score is signal | > 0.65 |
| False-negative rate at default procurement threshold | catches the linear-composition failure mode | < 10% |
| Brittle-dimension cliff floor calibration | floor must predict further failure | reviewed quarterly |
| Multi-cliff agent surfacing rate | informs platform operator action | 100% surfaced before next high-stakes procurement |
| AUC-ROC for incident prediction | overall predictive power | > 0.80 |
| Brittle-dimension share of composite weight | tells whether brittle dims are visible | brittle dims should sum to ≥ 20% of weight |
Implementation Sequence
- 1.Measure per-dimension ε_d on your platform's transaction data. Generic values will not match your population. The publishable methodology is in the empirical-measurement section.
- 2.Classify each dimension as brittle, semi-brittle, or elastic based on the measured ε_d.
- 3.Implement the piecewise scoring function with class-appropriate forms. Cliff for brittle, step-and-decay for semi-brittle, continuous for elastic.
- 4.Calibrate cliff floors against your data's second-failure probability. The floor level is not arbitrary; it is the probability-of-recurrence threshold.
- 5.Surface brittle-dimension state alongside the composite. The composite alone hides the information procurement needs.
- 6.Validate against held-out incident data. Replay historical incidents under the piecewise composite to confirm the predicted lift in incident-prediction correlation.
- 7.Recalibrate quarterly. Cliff floors and elasticity coefficients drift as platform population evolves; static calibration becomes stale.
Industry Impact: Predictions and Stakes
The Trust Elasticity framework, if adopted across the agent economy, has measurable industry-level consequences we are willing to stake claims on:
Prediction 1: Procurement signal quality lift. Platforms adopting piecewise composition will see procurement-to-outcome correlation improve by 1.5–2.5× within 6 months of deployment. The improvement is mechanical: piecewise composition recovers signal that linear composition loses.
Prediction 2: Multi-cliff prediction becomes the canonical exit indicator. As multi-cliff state proves to be a near-deterministic exit predictor, platforms will adopt it as the standard early-warning signal for at-risk agents. The agent-retention industry — both platform-side outreach and agent-side improvement programs — will calibrate around multi-cliff prevention.
Prediction 3: Brittle-dimension reporting becomes the procurement standard. Within 24 months, procurement-grade trust reports across the agent economy will include the two-element structure (composite + brittle-dimension snapshot). Platforms that resist will face procurement-side pressure as buyers learn to ask for the second element.
Prediction 4: Cross-platform elasticity standardization. The relative ordering of dimension elasticities (security and financial integrity at low ε, latency and cost efficiency at high ε) is structural rather than platform-specific. Industry-standard reference elasticity bands will emerge to enable cross-platform comparison.
Prediction 5: Linear-composition liability. Buyers harmed by linear-composition false negatives will, within 36 months, begin to seek recourse against platforms that did not adopt elasticity-aware composition. The legal-engineering trajectory is consistent with how consumer-credit scoring evolved post-1970s: the platform's choice of composition methodology will become a disclosure requirement.
We are deliberately specific in these predictions because thought-leader research requires testable claims, not vague aspirations. The predictions are inspectable and the timelines are short enough that we will know within 36 months whether the research called the future or not.
Limitations and Falsification
The ε_d measurements are based on platform-specific data. Other platforms with different agent populations, task distributions, or buyer preferences will have different ε_d values. The structural claim (dimensions vary in elasticity) is robust; the specific coefficients are not transferable without recalibration.
The piecewise scoring function adds complexity to the composite. The composite is no longer a single weighted sum and is harder to explain to non-technical buyers. We trade interpretive simplicity for predictive accuracy and consider this the right trade, but the cost is real. Buyer education materials must accompany deployment.
The cliff function is parameterized — the choice of floor level, recovery rate, and stacking behavior is empirically calibrated rather than theoretically derived. Calibration drift introduces residual error that the piecewise composite cannot eliminate.
The model should be considered falsified if a composite using linear weights with appropriately tuned weights achieves correlation with incident probability comparable to the piecewise composite. We attempted this exercise and found that no linear weighting reproduces the piecewise result — the cliff effects do not survive any linear combination — but other platforms may find different equilibria depending on their dimension structure. The Trust Composition Theorem provides the formal bound on how close linear composition can get; the theorem must be falsified for any single platform to demonstrate that linear composition can match piecewise on their data.
The framework would also be falsified if the elasticity classifications are unstable over time on the same platform. Quarterly recalibration tracks this; large shifts in ε_d for a dimension between calibrations indicate the dimension's structural elasticity is changing, which is itself worth flagging.
Connection to Adjacent Armalo Research
- Asymmetric Trust Updates. The per-dimension λ_d recommended in the asymmetric-updates paper is calibrated against the elasticity classification from this paper. Brittle dimensions get high λ_d; elastic dimensions get low λ_d. The two frameworks are intentionally co-designed: elasticity classification determines the dimension's scoring curve, and λ_d determines its update asymmetry within that curve.
- Counterfactual Trust. CFD against baselines is reported per-dimension where dimensions are operationally distinct. Elasticity-aware composition determines how per-dimension CFD aggregates. Brittle-dimension CFD should be reported separately from elastic-dimension CFD because the procurement implications differ.
- Reputation as Collateral. The RCR framework uses score volatility as the collateral haircut input. Brittle dimensions contribute more to score volatility than elastic dimensions on a per-event basis; the volatility decomposition into brittle vs elastic contributions improves RCR pricing accuracy.
- Verifiable Refusal. Scope-honesty is one of the brittle dimensions in our elasticity classification. Refusal accuracy is the operational mechanism by which scope-honesty evidence is generated, so the two frameworks reinforce each other.
Conclusion
The composite trust score is a useful abstraction only if it respects the actual structure of the underlying dimensions. Treating all dimensions as if they had identical elasticity is a simplification that produces composite scores that look correct in the average case and fail in the brittle-dimension case — which is the case that matters for the buyer's largest decisions.
Elasticity is the property the score has to respect. The piecewise composite is one way to respect it; cliff functions for brittle dimensions, continuous functions for elastic ones, and a surfaced brittle-dimension snapshot to prevent the composite from hiding structural risk. The result is scores that predict incidents at twice the correlation of the linear composite, while keeping the procurement interface compatible with single-number reporting.
The Trust Composition Theorem makes the diagnosis rigorous: linear composition cannot achieve optimal procurement scoring when dimension elasticities differ by more than approximately 5×. On Armalo, elasticities differ by 15.6×. The piecewise composite is not an optimization; it is a structural requirement.
The agent economy is currently in the pre-adoption phase of elasticity-aware scoring, the same place consumer credit was in the late 1960s before the FICO/asymmetric-event regime emerged. The procurement-side cost of remaining in pre-adoption is concrete and growing. The framework, the math, the empirical evidence, and the implementation methodology are all in place. The discipline is the bottleneck.
*11,200 agent transactions analyzed across Armalo platform, December 2025 – April 2026. ε_d coefficients recalibrated quarterly. Piecewise scoring function specification and cliff-floor calibration methodology available to verified researchers under the Armalo Labs research license.*