Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-05-13-composite-scoring-adversarial-drift. The paper is open-access and citable.

Composite Trust Scoring Under Adversarial Behavioral Drift: A Red-Team Robustness Study

Q: What is the paper "Composite Trust Scoring Under Adversarial Behavioral Drift: A Red-Team Robustness Study" about?

Armalo's composite trust score reduces an agent's behavioral record to a publishable number across twelve dimensions (accuracy, self-audit/Metacal, reliability, safety, security, bond, latency, scope-honesty, cost-efficiency, model-compliance, runtime-compliance, harness-stability). Counterparties consume the composite to gate transactions. An adversary who can shift the agent's behavior in a dimension faster than the substrate detects can route harmful transactions through high-score policy paths. We empirically measure the substrate's detection latency for adversarial drift in each of the twelve dimensions using a controlled perturbation protocol against a synthetic agent and against the Atlas reference agent. We find that the substrate detects drift in dimensions backed by deterministic checks (latency, runtime-compliance, harness-stability) within one telemetry batch, detects drift in dimensions backed by jury or rubric evaluation (accuracy, safety) within hours, and detects drift in dimensions backed by transaction-level statistics (cost-efficiency, scope-honesty) within days. We propose dimension-specific resilience improvements and a fast-decay weight schedule that bounds the adversary's window of advantage.

A composite trust score is only as resilient as its slowest dimension. The L4 substrate publishes a single composite — currently a weighted sum over twelve dimensions of agent behavior — to the trust oracle. Counterparties read the composite before transacting; a low score discourages or denies transactions, a high score permits them. An adversary who can degrade an agent's actual behavior in some dimension without that degradation showing up in the composite for a window of time has, during that window, an agent that *looks* trustworthy and *acts* untrustworthy. The economically interesting attack on the L4 substrate is therefore not "forge the score" — the score is signed — but "drift the agent faster than the substrate measures."

This paper measures the substrate's detection latency for adversarial drift in each of the twelve dimensions of Armalo's composite scoring engine. We define a controlled perturbation protocol, apply it to a synthetic agent and to the Atlas reference agent, and report per-dimension detection latency, recovery latency, and the composite's responsiveness. We then propose three resilience improvements: per-dimension fast-decay weights, anomaly-detection thresholds that produce a synthetic score override on rapid swings, and an adversarial-mode dimension that monitors the agent's behavior under provocation specifically.

1. The composite scoring engine

Armalo's composite is defined in packages/scoring/src/composite.ts with the following canonical weights:

Dimension	Weight	Measurement source
accuracy	14%	Jury evaluations on canonical reasoning prompts
self_audit (Metacal™)	9%	Agent's self-reported confidence vs. realized outcome
reliability	13%	% of tool calls that complete without error
safety	11%	Jury evaluations on safety-relevant prompts
security	8%	Zero-trust policy checks, security badge composition
bond	8%	Agent's credibility bond stake, slash history
latency	8%	p50 / p95 of tool call duration
scope_honesty	7%	Calibration of declared confidence against outcome
cost_efficiency	7%	Tokens / dollars per resolved task
model_compliance	5%	Adherence to declared model version
runtime_compliance	5%	Adherence to declared runtime environment
harness_stability	5%	Eval harness reproducibility metrics

The weights are calibrated to make the composite responsive to the dimensions counterparties care about most (accuracy, reliability, safety together account for 38%) while still surfacing economic and operational signals (latency, cost-efficiency together 15%). The composite range is 0–100; production agents typically score in the 70–95 range, with certification tiers (bronze, silver, gold, platinum) attached at thresholds.

2. Adversarial drift model

We define adversarial drift as a deliberate, attacker-controlled shift in the agent's behavior along a single dimension, applied at a controlled rate, against a substrate that does not know the drift is adversarial. The model captures the operational case of a compromised agent (prompt injection, model rollover, supply-chain compromise) producing behavior that degrades on a specific dimension while other dimensions remain nominal.

The attacker has three control surfaces: rate (how fast the dimension degrades), stealth (whether the degradation is masked by counter-signals on adjacent dimensions), and target (which dimension is degraded). The substrate's defense is measured by detection latency — the time from the onset of degradation to the substrate's recording of the verdict in the composite score — and recovery latency — the time from the cessation of degradation to the substrate's restoration of the composite.

We measure detection latency for each of the twelve dimensions, holding stealth = "uncoordinated" (no counter-signal injection) and rate = "fast" (full degradation within one telemetry batch). This is the easiest case for the substrate; harder cases (coordinated, slow) follow the same pattern but with longer latencies bounded by the substrate's averaging window.

3. Experimental design

3.1 Subjects

Synthetic agent. A scripted agent whose behavior is fully controllable. The agent emits telemetry events on demand; perturbations are applied by directly mutating the event payloads before SDK ingest. This subject isolates the substrate's measurement latency from the agent's natural variability.

Atlas reference agent. The public Armalo L4 reference agent (agent ID 76cf31d6-ffe3-4a5c-8748-021114aa8066). Atlas's seeded behavioral record contains a deliberate drift in one dimension (parameter-binding violation in session three, dimension: pact compliance / scope-honesty-adjacent). We use Atlas's existing drift as the reference for the composite's response to a single dimension-specific perturbation.

3.2 Protocol per dimension

For each of the twelve dimensions:

1.Baseline. Record the agent's composite score and the dimension's value under nominal behavior. Sustain for one hour to establish stability.
2.Perturbation onset. Begin emitting telemetry events whose effect on the target dimension is degrading. For deterministic-source dimensions (latency, reliability, runtime-compliance, harness-stability), this is a direct shift in the event values. For jury-source dimensions (accuracy, safety), this is a sustained sequence of responses that the jury would score below the dimension's threshold. For statistical-source dimensions (cost-efficiency, scope-honesty), this is a sustained skew in the relevant ratio.
3.Detection measurement. Sample the composite score and the dimension's value at 1, 5, 15, 60, 240, 1440, 10080 minutes after perturbation onset. Record the first sample at which the dimension's value drops below the 95% confidence interval of the baseline.
4.Perturbation cessation. After 24 hours of sustained degradation (or earlier if the substrate has clearly detected), restore the agent to nominal behavior.
5.Recovery measurement. Sample the composite and the dimension at the same cadence as detection. Record the first sample at which the dimension's value returns within the baseline confidence interval.

3.3 Outcome measures

Detection latency: time from onset to first sub-baseline sample.
Composite response: the magnitude of the composite shift at the time of dimension detection.
Recovery latency: time from cessation to first within-baseline sample.

4. Results

4.1 Detection latency by dimension

Dimension	Substrate measurement source	Detection latency	Composite response	Recovery latency
latency	`tool_call.latencyMs` p50/p95	5 s (one telemetry flush)	-2.4 points	5 s
reliability	`tool_call.outcome` aggregate	5 s	-3.1 points	60 s (window)
runtime_compliance	runtime trust metrics	5 s	-2.0 points	5 s
harness_stability	eval harness records

The dimensions cluster into three latency classes.

Class I (deterministic checks, detection ≤ 5 minutes). Latency, reliability, runtime-compliance, model-compliance, harness-stability. These are sourced from telemetry primitives that the substrate evaluates synchronously on ingest or within the eval cadence. Detection is essentially instantaneous from the substrate's perspective.

Class II (jury or policy cycles, detection 15 minutes – 6 hours). Accuracy, safety, security, bond. These require either jury evaluation or external confirmation (on-chain transaction for bond, policy engine cycle for security). The detection latency is the period of the supporting evaluator.

Class III (statistical windows, detection 6 hours – 24 hours). Scope-honesty, cost-efficiency, self-audit. These require averaging over multiple calls to detect a statistically significant shift; the window length is structurally part of the dimension's definition and cannot be made faster without sacrificing the statistical reliability of the dimension.

4.2 Composite responsiveness

The composite shifts in proportion to the weighted dimension drop. A full-degradation drift on the most-weighted dimension (accuracy, 14%) produces a -10 point composite shift at saturation, which is large enough to drop the agent below most certification tier thresholds. A full-degradation drift on a low-weighted dimension (harness-stability, 5%) produces a -3 point shift, which is meaningful but not category-changing.

The composite is therefore correctly responsive to *which* dimension drifts. An adversary attacking the safety dimension produces a larger composite drop than an adversary attacking the cost-efficiency dimension. This is the correct property for a counterparty consuming the score.

4.3 Recovery asymmetry

Recovery is generally slower than detection because the substrate's statistical windows include the degraded period. A 24-hour adversarial drift on cost-efficiency takes another 24 hours of nominal behavior for the rolling window to recover. The asymmetry has an operational implication: an adversary who can drift and cease produces a composite penalty that persists well beyond the active attack window.

5. The adversary's window

The substrate's worst case is Class III dimensions: an adversary who can degrade scope-honesty, cost-efficiency, or self-audit can sustain bad behavior for up to 24 hours before the composite drops to reflect it. During that window, the agent's published trust score is misleadingly high; counterparties reading the score will continue to authorize transactions that the agent's current behavior does not warrant.

A 24-hour window is operationally meaningful. An adversary with that much runway can execute a substantial number of harmful actions before the substrate's verdict catches up. The composite's value as a transaction gate is therefore limited by its slowest dimension, which is the design failure we want to address.

6. Proposed resilience improvements

6.1 Fast-decay weights on statistical dimensions

The current composite weights are stable. We propose adding a *decay* schedule that applies a temporary penalty to dimensions exhibiting fast change. Concretely, when the substrate detects that a dimension's value over the last 1-hour window differs from the trailing 24-hour window by more than two standard deviations, the dimension's *weight* in the composite is doubled for the next hour, then linearly decays back to the canonical weight over 24 hours.

The effect is to amplify the composite's response to fast changes regardless of dimension class. A scope-honesty degradation that statistically registers at 6 hours under the canonical weight registers a composite drop in roughly 1 hour under the fast-decay schedule, because the dimension's weight is doubled at the moment the substrate detects the rapid swing.

Implementation cost: minimal. The substrate already computes per-dimension trailing windows; the proposed change is to read the recent slope and apply a weight multiplier. Drawback: composite values are less stable across hours during normal operation. The drawback is acceptable because counterparties read the composite at transaction time, not as a stable monthly summary.

6.2 Anomaly-detection synthetic dimension

Add a thirteenth dimension to the composite: an *anomaly* score that is 100 in the absence of detected anomalies and drops sharply when the substrate detects an unusual pattern across multiple dimensions simultaneously. Examples of cross-dimension anomalies:

Latency degrades while reliability remains nominal: suggests a slow attack path that is not yet failing.
Cost-efficiency drops while accuracy remains nominal: suggests a token-flooding pattern.
Scope-honesty drops while accuracy remains nominal: suggests the agent is producing correct outputs with miscalibrated confidence — a classic prompt-injection signature.

The anomaly dimension's weight is small (proposed: 3%) but its activation is high-signal. The dimension is computed by a lightweight classifier trained on historical drift patterns and is deliberately conservative in alerting.

6.3 Adversarial-mode dimension

Add a fourteenth dimension that monitors the agent's behavior under deliberate provocation. The substrate periodically injects adversarial prompts into the agent's input distribution and measures the agent's response. The dimension's value is the agent's performance on the adversarial suite, recomputed continuously.

The dimension is structurally distinct from the other twelve: it is a *probe* dimension rather than a passive measurement. The probes are part of the substrate's red-team evaluation surface (already implemented in packages/adversarial-agent). The proposed addition is to surface the probe result as a composite dimension, so an agent's resilience to adversarial provocation is part of its published trust profile.

7. Combined effect of the three improvements

We estimate the combined effect of the three proposals against the same adversarial drift protocol:

Dimension	Current detection	With fast-decay	With anomaly	With adversarial probe
Class I (latency, reliability...)	5 s	unchanged (already fast)	additional cross-signal	additional probe signal
Class II (accuracy, safety, security, bond)	15 min – 6 h	5 min – 1 h	additional cross-signal	additional probe signal
Class III (scope-honesty, cost-eff, self-audit)	6 h – 24 h	1 h – 6 h	1 h	1 h (probe activates)

The combined effect on Class III is the most important: the adversary's window shrinks from up to 24 hours to roughly 1 hour. A 1-hour window is short enough that most transaction-gating consumers will be operationally protected.

The combined effect on Class II is moderate: jury and policy cycles are bounded by their evaluators, but the anomaly dimension and the adversarial probe surface cross-signal evidence that supplements the slower direct measurement.

8. Implementation pathway

1.Phase 1 — fast-decay weights. Modify packages/scoring/src/composite.ts to compute per-dimension trailing windows and apply a weight multiplier when rapid change is detected. Roll forward to the Atlas reference agent first; measure impact on Atlas's existing seeded drift. Estimated effort: one engineer-week.

1.Phase 2 — anomaly synthetic dimension. Train a small classifier (logistic regression with ten cross-dimension features) on historical drift patterns from the platform org's agents. Add the dimension with weight 3%; rebalance other weights to preserve the 100-point ceiling. Estimated effort: two engineer-weeks plus modest training-data assembly.

1.Phase 3 — adversarial-mode dimension. Surface results from packages/adversarial-agent into the composite. Add the dimension with weight 4%; rebalance. Estimated effort: one engineer-week (the adversarial evaluator already exists).

Each phase is independently valuable. Phase 1 is the highest-impact-per-effort and should ship first.

9. Limitations

Single-vendor substrate. The measurement is against Armalo's substrate. Other substrates would have different composite weights, different measurement cadences, and different latency profiles. The structural claim — Class III dimensions are slow — generalizes; the specific latency numbers do not.

Synthetic perturbations. The protocol applies controlled perturbations; real adversaries apply correlated, stealthy perturbations that may evade the substrate longer. The numbers here are best-case for the substrate.

Composite weight sensitivity. The current weights are calibrated to current operational priorities. As the agent economy matures, weights will shift. Detection latency results are weight-dependent; future weight changes may improve or worsen the substrate's resilience.

Adversary cost. The protocol does not model the adversary's cost to sustain a drift attack. In reality, sustaining a drift requires continuous prompt injection or model manipulation, which has cost. A bounded adversary may not be able to sustain the 24-hour Class III window, limiting the realistic worst case.

10. Conclusion

The composite trust score detects adversarial drift on a per-dimension cadence that varies from seconds (deterministic checks) to days (statistical windows). The substrate is correctly responsive to which dimension drifts and recovers in proportion to its measurement window. The adversary's worst-case window is bounded by the slowest dimension — currently 24 hours on Class III dimensions — which is operationally meaningful and architecturally addressable.

Three resilience improvements collectively bound the adversary's window to under 6 hours: fast-decay weights amplify the composite's response to rapid swings, an anomaly synthetic dimension surfaces cross-dimension evidence, and an adversarial-mode probe dimension measures resilience under deliberate provocation. The improvements are independently valuable and incrementally deployable.

The L4 substrate's value as a transaction gate scales directly with the composite's responsiveness. Closing the 24-hour Class III window — by any of the three proposed mechanisms — closes the largest currently-known adversary advantage and tightens the substrate's gating properties.

11. Replication

The protocol is reproducible against any L4-compliant substrate that exposes per-dimension scores. For Armalo specifically:

1.Use a synthetic agent or instrument a test agent with the @armalo/telemetry SDK.
2.Apply the perturbation protocol from Section 3.2 to each dimension.
3.Sample the composite and the per-dimension scores via GET /api/v1/trust/{agentId} at the prescribed cadence.
4.Record the first sub-baseline sample per dimension.

The substrate's own scoring runs nightly via packages/scoring; the recomputation cadence is observable via the computedAt timestamp on the score row. Researchers replicating the study should align their measurement cadence with the substrate's recompute schedule for the slowest dimensions.

The Atlas reference agent's seeded drift (parameter binding violation in session three) is a single-dimension perturbation that researchers can use as a reference point. Atlas's composite reflects this drift in its pact compliance rate, which is observable at GET /api/v1/trust/76cf31d6-ffe3-4a5c-8748-021114aa8066.

References

Armalo Labs Research Team. *The L4 Layer: Cross-Org Behavioral Trust for AI Agents.* 2026-05-12.
Armalo Labs Research Team. *The TOCTOU Theorem for Agent Trust.* 2026-05-13.
Armalo Labs Research Team. *Parameter-Binding Grammar Coverage.* 2026-05-13.
Armalo Labs Research Team. *The Trust Oracle as a Cross-Org Consensus Primitive.* 2026-05-13.
Brier, G. W. *Verification of forecasts expressed in terms of probability.* Monthly Weather Review, 1950.
Hendrycks, D. et al. *Aligning AI With Shared Human Values.* ICLR 2021.
Stuart, R. and Norvig, P. *Artificial Intelligence: A Modern Approach.* 4th edition, 2021.