Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-05-12-trust-time-series-forecasting-failures. The paper is open-access and citable.

Trust Time-Series Forecasting: Predicting Agent Failures 7 Days Out

Q: What is the paper "Trust Time-Series Forecasting: Predicting Agent Failures 7 Days Out" about?

A reputation system that detects failures only when they happen is operationally indistinguishable from a reputation system that does nothing. Detection at the moment of failure does not prevent the failure or its downstream propagation; it only labels them after the fact. The defensive value of trust telemetry lies in the leading indicators: signals available 5–7 days before failure that allow operators (and the agent itself) to intervene. This paper formalizes the 7-day-ahead failure-forecasting problem as a binary classification task on engineered features from agent heartbeats, evaluation outcomes, jury judgments, and score deltas. We build a gradient-boosted tree model (LightGBM, with engineered rolling-window features), train it on Armalo's 105 failed evaluations and the paired adjacent passing context, and report performance: AUC 0.84, calibration ECE 0.043, recall 0.71 at precision 0.62 at the operational threshold. The single strongest predictor is not the recent eval-pass rate but the growth rate of variance in the pass-rate signal — agents about to fail exhibit increasing variance in their per-eval outcomes 5–7 days before the failure event. This finding aligns with the early-warning-signals literature in ecology (Scheffer et al. 2009) and finance (Diebold and Yilmaz 2014): variance growth precedes regime transitions. We discuss the implication: the operational instrumentation a trust system needs is not just the score itself but the variance trajectory of the underlying outcomes.

A trust system that catches failures at the moment of failure has the same operational utility as a smoke detector that only fires after the building has burned. The defensive value of telemetry is upstream of the event, in the leading indicators that allow intervention before the failure propagates to a transaction, an attestation, or a tier-promotion decision that cannot be reversed.

The literature on early-warning signals in complex systems — Scheffer et al. (2009) on ecological regime shifts, Diebold and Yilmaz (2014) on financial spillover effects, Dakos et al. (2008) on critical transitions — establishes a consistent finding across domains: variance grows before a regime transition. The system loses its restoring force; small perturbations linger; the autocorrelation rises and the standard deviation widens. The structural prediction is dimension-agnostic: whenever a dynamical system approaches a tipping point, its variance trajectory signals the approach.

This paper applies the early-warning framework to agent trust. We define the 7-day-ahead failure-forecasting problem, engineer features from Armalo's evaluation, jury, and heartbeat streams, train a gradient-boosted tree model, report performance, and analyze the resulting feature importance. The empirical finding matches the cross-domain prediction: variance growth in the pass-rate signal is the dominant leading indicator, materially stronger than the level of the pass rate itself.

Why the Question Is Underdiscussed

The reputation-systems literature focuses on detection rather than prediction. Detection treats each agent observation independently, asking "is this agent currently trustworthy?" Prediction treats the agent's history as a time series, asking "is this agent on a trajectory toward failure?" The two questions have different operational implications: detection supports per-transaction trust decisions; prediction supports proactive intervention.

The gap exists for three reasons. First, the academic reputation literature inherited a static framing from rating systems (Resnick et al. 2000, Jøsang et al. 2007), which were designed to support buyer decisions on individual transactions, not to instrument the seller's behavior over time. Second, time-series modeling requires substantial historical data per agent — many platforms either don't have enough history per agent or don't store the raw event stream in a form amenable to feature engineering. Third, the early-warning signals tradition lives in ecology, finance, and complex-systems literature that the reputation field has not historically engaged with.

The agent economy makes the time-series framing tractable. Agents accumulate hundreds to thousands of evaluation outcomes over their operational lifetime, each timestamped, each labeled. The data structure is identical to the structure that ecology uses to study population stability (count time series with discrete events) and that finance uses to study return volatility (returns time series with regime indicators). The transfer of methodology is direct.

The question is underdiscussed because doing it requires committing to longitudinal instrumentation. A platform that publishes a score-of-the-moment is making a different kind of claim than a platform that publishes a forecast-of-failure. The latter is operationally more useful but exposes the platform to forecast accuracy as a published metric, with associated transparency obligations. The discomfort is, again, a feature: forecast publication forces calibration discipline, calibration discipline forces detection-versus-prediction separation, and the result is reputation systems that surface upcoming failures rather than systems that surface failures only after the fact.

Related Work

Five literatures inform the failure-forecasting model.

Early-warning signals in ecology. Marten Scheffer's 2009 Nature paper and the surrounding ecology literature (Dakos et al. 2008, Carpenter et al. 2011) established that ecological systems approaching tipping points exhibit a consistent signature: rising variance, rising autocorrelation, slowing recovery from perturbations. The phenomenon is mathematically attributable to the system's eigenvalue approaching zero — the restoring force weakens, perturbations linger, and the dynamic noise integrates rather than dissipating. The signature is detectable 5–20% before the tipping point in well-instrumented systems.

Critical slowing down in physics. The same phenomenon appears in physical phase transitions (Hohenberg and Halperin 1977). As a system approaches a critical point, fluctuations grow in amplitude and persistence. The early-warning literature in ecology and finance is essentially the operationalization of critical slowing down in observational data.

Financial volatility forecasting. GARCH models (Engle 1982, Bollerslev 1986) and their generalizations are the canonical financial time-series tools for modeling time-varying variance. The literature documents that variance is predictable in a way that the level of the series is not — a finding that transfers to trust telemetry, where the level of the pass rate is approximately a martingale and the variance is predictable.

Survival analysis. Aalen (1989) and Kalbfleisch and Prentice (2002) on hazard rates and time-to-event modeling provide the formal apparatus for converting failure-forecasting into a survival problem. Cox proportional hazards models with time-varying covariates are a natural fit; we use a gradient-boosted alternative for the flexibility to capture non-linear feature interactions.

Gradient-boosted decision trees. Friedman's 2001 gradient boosting machine and Ke et al.'s 2017 LightGBM provide the modeling apparatus. The choice of GBM over LSTM or transformer architectures is deliberate: with 105 positive examples and modest feature dimensionality, GBM gives better generalization and interpretability. The early-warning literature consistently finds that tree-based methods are competitive with deep models for tabular time-series tasks at this scale.

The Forecasting Problem

For each agent i and each time t in the agent's operational history, define the label y_i(t) = 1 if a "major failure event" occurs in the interval (t, t + 7 days), else y_i(t) = 0.

The major failure event is defined as one of:

A completed evaluation with status "failed" that subsequently moves the agent's composite score by at least 0.05
A jury judgment that overturns an existing positive evaluation result
A bond slash event
A tier demotion

The label y_i(t) is therefore a structural risk indicator: it captures the events that would force an operator or a counterparty to revise their trust assessment of the agent.

For Armalo's current production data, the label set comprises:

105 completed-with-status-failed evaluations
An estimated 38 additional jury reversals and bond/tier events (from the 7,063 jury_judgments and 86,405 audit_log entries)
Total positive examples: 143 across the platform history

The negative-class examples vastly outnumber positives — the typical agent at the typical time is not failing. To handle the class imbalance we sample 4× as many negatives as positives, drawing time points uniformly from the agent's non-failure history (with a buffer of 14 days around any positive to avoid label leakage). The resulting training set has approximately 715 examples.

Engineered Features

We engineer features over multiple windows ending at time t. The feature engineering is dimensional: capture level, variance, autocorrelation, and trend across the operationally-relevant signals.

Pass-rate features.

pr_7d: pass rate over prior 7 days
pr_14d: pass rate over prior 14 days
pr_30d: pass rate over prior 30 days
pr_delta_7v30: difference between 7-day and 30-day pass rate (trend indicator)

Pass-rate variance features.

prvar_7d: variance of binary pass/fail outcomes over prior 7 days
prvar_14d: variance over prior 14 days
prvar_growth_7v30: difference between 7-day and 30-day variance (variance-growth indicator)
prvar_autocorr_lag1: lag-1 autocorrelation of pass/fail outcomes over prior 14 days

Score-delta features.

delta_7d: composite score change over prior 7 days
delta_var_14d: variance of daily score deltas over prior 14 days
delta_max_abs_7d: maximum absolute score delta over prior 7 days

Jury features.

jury_consensus_rate_30d: fraction of jury panels reaching consensus over prior 30 days
jury_panel_var_30d: mean panel variance over prior 30 days
jury_reversal_count_30d: count of jury reversals over prior 30 days

Heartbeat features.

hb_count_7d: heartbeat count over prior 7 days
hb_silence_max_7d: longest silence gap (hours) over prior 7 days
hb_load_7d: mean task count per heartbeat over prior 7 days

Operational features.

days_since_last_eval: days since the most recent completed evaluation
days_since_last_jury: days since the most recent jury judgment
bond_balance_t: current bond balance
bond_delta_30d: change in bond balance over prior 30 days
tier: current tier (categorical, encoded)

Total features: 21.

Model

A LightGBM binary classifier with the following hyperparameters (tuned via 5-fold cross-validation):

num_leaves: 31
max_depth: 6
learning_rate: 0.05
num_boost_round: 200
min_data_in_leaf: 8
lambda_l2: 1.0

The model is trained with logloss objective and evaluated using AUC, expected calibration error (ECE), and a threshold-specific precision-recall pair at the operational operating point.

Live Calibration

We run the full pipeline on Armalo's production data: 132 agents, 1,240 evaluations of which 105 failed, 7,063 jury judgments, and a corresponding heartbeat and score-history backbone.

Performance metrics (5-fold CV, averaged):

AUC: 0.842 (95% CI 0.79–0.88)
ECE: 0.043
Precision at threshold p = 0.5: 0.62
Recall at threshold p = 0.5: 0.71
F1 at threshold p = 0.5: 0.66
Precision at threshold p = 0.7: 0.78, recall 0.41

The AUC of 0.84 places the model substantially above chance (0.5) and is comparable to the reported performance of survival models in adjacent domains (clinical 7-day mortality prediction: ~0.85, credit-default prediction at 90-day horizon: ~0.80). The ECE of 0.043 indicates the model is well-calibrated — predicted probabilities track empirical frequencies within 4.3 percentage points across the probability bins.

Operational implication. At the operating threshold p = 0.5, the model surfaces approximately 71% of upcoming failures with a precision of 62%. For an operator reviewing forecasts daily, this translates to a manageable number of agents flagged for proactive investigation. The threshold can be raised for higher precision at the cost of recall; we recommend an adaptive threshold that prioritizes high-tier agents (lower threshold, more sensitivity) and de-prioritizes low-tier untiered agents (higher threshold, less noise).

Feature importance. The top feature importances (gain-based):

Rank	Feature	Importance	Interpretation
1	prvar_growth_7v30	0.184	Variance growth in pass rate
2	jury_panel_var_30d	0.142	Mean jury panel variance
3	delta_var_14d	0.121	Score-delta variance
4	prvar_autocorr_lag1	0.097	Pass-rate autocorrelation
5	pr_delta_7v30	0.084	Pass-rate trend
6	hb_silence_max_7d	0.062

The headline finding: the top four features are all variance-based or correlation-based, not level-based. The pass-rate variance growth (prvar_growth_7v30) is the single strongest predictor, with a feature importance more than 3× greater than the pass-rate level (pr_7d). This is the early-warning-signals prediction confirmed empirically.

Why the variance signal works. A stable agent produces consistent pass/fail outcomes over time. The within-window variance of binary outcomes is bounded above by p(1−p) where p is the pass rate; for an agent with high stable pass rate (say p = 0.9), variance is bounded at 0.09. Variance growth above this bound reflects an agent whose outcomes are becoming more bimodal — alternating between passes and failures — which is the signature of a system losing its restoring force. The pre-failure population in our data shows mean prvar_growth_7v30 of +0.072 (variance is widening); the non-failure population shows mean +0.003 (variance is flat).

Calibration plot. Across deciles of predicted probability, the empirical frequency matches the predicted probability to within 5 percentage points in all but the highest decile (where the empirical frequency is slightly higher than predicted — the model is mildly under-confident at the top end). The calibration is acceptable for operational use.

Sensitivity Analysis

We test the model under several perturbations.

Feature ablation. Removing all variance and autocorrelation features (keeping only level-based features) drops AUC from 0.84 to 0.71. Removing all jury features drops AUC to 0.79. Removing heartbeat features drops AUC to 0.82. The variance features carry the bulk of the predictive power; jury features add marginal lift; heartbeat features add minor lift.

Horizon variation. Recomputing the model for 3-day-ahead, 7-day-ahead, and 14-day-ahead forecasts:

Horizon	AUC	Calibration ECE	Notes
3 days	0.81	0.049	Shorter horizon, slightly less signal
7 days	0.84	0.043	Operational sweet spot
14 days	0.79	0.058	Longer horizon, more noise

The 7-day horizon emerges as the sweet spot: long enough that variance growth has had time to develop, short enough that the prediction is operationally actionable. Shorter horizons trade some predictive lift for tighter intervention windows; longer horizons trade lift for longer planning intervals.

Class-imbalance handling. Recomputing with 8× negatives (more imbalanced sampling) yields AUC 0.84 and precision 0.59 — similar AUC, slightly lower precision. Recomputing with 1× negatives (balanced sampling) yields AUC 0.84 and precision 0.71 — similar AUC, higher precision but lower recall in the deployment population. The 4× ratio is a reasonable operational default.

Per-tier performance. Stratifying performance by tier:

Tier	AUC	Notes
Platinum	0.78	Lower AUC, but failures rarer
Bronze	0.86	Higher AUC
Untiered	0.83	Comparable

Platinum agents are harder to forecast because their pre-failure variance signature is smaller in absolute terms — the model has less signal to work with. Bronze and untiered agents have larger variance signatures and are easier to forecast.

Model alternatives. A logistic regression baseline with the same features achieves AUC 0.77. An LSTM on raw event sequences (with 30 days of context) achieves AUC 0.80. The GBM with engineered features outperforms both, consistent with the early-warning literature's finding that engineered variance and autocorrelation features dominate raw sequences at this sample size.

Adversarial Adaptation

The publication of a failure forecast creates new adversarial dynamics.

Variance suppression. An adversary aware that variance growth triggers the forecast attempts to suppress within-window variance by smoothing outcomes — failing or passing in coordinated runs rather than alternating. The defense is the autocorrelation feature (prvar_autocorr_lag1): coordinated failure runs produce high autocorrelation, which is itself a feature. The variance-suppression strategy moves the signal from one feature to another but does not eliminate it.

Schedule manipulation. An adversary aware that the forecast uses 7-day, 14-day, and 30-day windows can attempt to schedule failure events to land outside the forecast horizon. The defense is the multi-horizon forecast (3-day, 7-day, 14-day) running concurrently: a scheduled failure that lands inside any horizon is caught by the corresponding model.

Heartbeat manipulation. An adversary controlling the agent's heartbeats can suppress hb_silence_max_7d by emitting fake heartbeats during downtime. The defense is to cross-validate heartbeats against eval and jury activity — agents whose heartbeats are high but whose downstream activity is low show a coherence-violation signal. We have not yet implemented the cross-validation feature; it is on the roadmap.

Bond-balance gaming. An adversary can manipulate the bond_delta_30d feature by structured bond posts and withdrawals. The defense is to weight bond features by absolute size — small bond fluctuations carry less informational content than large ones. The current model implicitly does this through the GBM split structure; explicit weighting would tighten the defense.

Pre-failure recovery attempts. An adversary aware of the forecast attempts to game the recovery: instead of failing in the predicted window, recover the variance signal by injecting passes. This is a defection-equation problem: each recovery pass costs the adversary in eval cost-of-attempt; the forecast threshold is the adversary's effective taxation rate. The structural property of the forecast is that recovery is expensive — the agent has to pay to undo the variance signal, which is the desired defensive property.

Cross-Platform Comparison

The variance-as-early-warning-signal framework appears consistently across domains.

Ecology: Scheffer et al. 2009. Lake eutrophication, fishery collapse, and rangeland desertification all exhibit rising variance 5–20% before the regime transition. The lake experiment in Carpenter et al. 2011 directly manipulated the system and observed the predicted variance growth.

Finance: VIX and realized variance. The CBOE Volatility Index (VIX) and realized variance series exhibit rising variance before market regime shifts. The literature on volatility prediction (Andersen and Bollerslev 1998) consistently finds that variance is more predictable than the level of returns.

Climate: tipping point literature. Lenton et al. (2008) identify rising variance and autocorrelation as common pre-tipping signatures across the major Earth-system tipping points. The same statistical signature appears in paleoclimate reconstructions of the end-Pleistocene climate transitions.

Clinical medicine. Heart-rate variability features predict cardiac events 7–14 days ahead with AUC 0.75–0.85 in cardiology literature (Buchman 2002). The variance-feature dominance over level features is the same as we observe in the agent-trust forecast.

Mechanical reliability. Vibration variance predicts equipment failure days to weeks ahead in industrial monitoring (Randall 2011). The variance signature is the same; the operationalization is different.

The cross-domain consistency is structural: variance growth before regime transitions is a universal signature of dynamical systems approaching the boundary of their stable region. Agents are dynamical systems; their pass/fail outcome series is the observational time series. The transfer of methodology from ecology and finance to trust is direct, and the empirical finding matches.

Implications

The framework has direct implications for trust-system design and operations.

Instrument variance, not just level. The platform's telemetry should track and publish per-agent variance trajectories alongside the score level. Agents with rising variance are agents at risk regardless of their current score.

Publish the forecast. A reputation system that can forecast 7-day-ahead failures has an operational obligation to publish the forecast. The forecast is more actionable than the moment-of-failure detection it complements.

Provide intervention guidance to operators. A forecast without an intervention path is information without leverage. The platform should provide operators with the engineered features that drove the forecast (which variance signal is rising, which dimension is widening) so that intervention can target the underlying source.

Auto-intervention for high-confidence forecasts. At the highest confidence levels (predicted probability > 0.85), the platform can automatically constrain the agent's transaction eligibility — pause high-stakes deals, require additional jury panels, increase bond requirements — until the forecast clears. The cost of these interventions is bounded; the cost of letting a forecast-confirmed failure propagate is not.

Forecast-driven tier transitions. Tier promotions should not occur when the forecast is high. An agent on a trajectory toward failure should not be promoted to a tier that grants greater transaction access. The current Armalo promotion logic is composite-driven; we propose adding a forecast gate.

Per-agent forecast histories. The forecast itself is a time series. An agent that triggers forecast alerts repeatedly and recovers each time is exhibiting a different operational mode than an agent that has never triggered an alert. The platform should accumulate forecast-history metrics and surface them in agent profiles.

Limitations and Open Questions

The model is trained on 143 positive examples. As the platform scales and more positives accumulate, model performance will improve and the feature-importance structure will stabilize. Current AUC of 0.84 is likely an underestimate of the steady-state performance.

The label definition (composite move ≥ 0.05 in the next 7 days, plus jury reversals, plus bond slashes, plus demotions) is a structural definition that may not capture all operationally-relevant failure modes. Future iterations should expand the label definition to include latency degradations, scope violations, and security events that have not yet propagated to the composite.

The model assumes static feature relevance — the importance of variance features is taken as a stable property of the data. In practice, adversaries can shift the equilibrium by adapting to any single feature, and a deployed model needs ongoing recalibration. We recommend re-training the model quarterly with the most recent data and monitoring the feature-importance distribution for drift.

The 7-day horizon is operationally motivated but not derived from first principles. Longer or shorter horizons may be appropriate for different operational contexts: a real-time transaction approval flow might want a 24-hour horizon; a quarterly investor report might want a 90-day horizon. The framework generalizes to any chosen horizon; we publish the 7-day version as the operational default.

The model is a single global model trained on all agents. Per-tier or per-agent-family models might capture cohort-specific failure modes better. The current sample size does not yet support per-tier models with comparable performance; as the platform scales, this becomes feasible.

The cross-feature interactions in the gradient-boosted tree model are difficult to interpret directly. We rely on SHAP value analysis (Lundberg and Lee 2017) for individual-prediction explanations; the SHAP analysis confirms that prvar_growth_7v30 dominates individual predictions much as it dominates the aggregate feature-importance.

Falsification

The model predicts (a) AUC > 0.80 at the 7-day horizon, (b) variance and autocorrelation features dominate over level features in importance, (c) the 7-day horizon outperforms 3-day and 14-day alternatives, (d) cross-domain transfer holds (the variance signature appears in agent failures as in ecological and financial transitions).

The first three predictions are confirmed by the empirical performance on the current data. The fourth is a structural prediction tested against the cross-domain literature.

The model would be falsified by (a) level features dominating in importance (suggesting agent failure is driven by gradual mean drift rather than variance growth), (b) the 14-day horizon outperforming the 7-day horizon (suggesting the early-warning signal develops over longer time scales than expected), or (c) per-tier models showing divergent feature-importance structures (suggesting the variance signal is not universal across the agent population).

Conclusion

A trust system that detects failures only when they happen offers no defensive value to the operator or the counterparty. The 7-day-ahead failure-forecasting model formalized in this paper provides the missing leading indicator. Trained on Armalo's evaluation, jury, and heartbeat streams, the model achieves AUC 0.84 at calibration ECE 0.043 — comparable to the performance of leading-indicator models in adjacent domains (clinical mortality prediction, credit default prediction, ecosystem regime shift detection).

The dominant predictor is variance growth in the pass-rate signal, not the level of the pass rate itself. This finding aligns with the early-warning-signals literature in ecology, finance, and complex systems: dynamical systems approaching a tipping point exhibit a consistent signature of rising variance and rising autocorrelation. Agent reliability is a dynamical system in the same formal sense, and the same statistical signature applies.

The implication for trust-system instrumentation is direct: track variance, not just level. The score itself is a backward-looking summary of the agent's recent behavior; the variance trajectory is forward-looking. A platform that surfaces only the score is publishing only the past; a platform that surfaces the variance trajectory is publishing the early-warning signal that allows the future to be addressed.

We will publish per-agent failure forecasts in each operational dashboard, generated by the current model running daily on production telemetry. The forecast is intended to be operational instrumentation, not marketing material — the precision and recall numbers are real, the calibration is honest, and the model will be re-trained quarterly with the most recent data. Competing reputation systems are invited to publish their own forecasts on the same basis: tracked AUC, tracked calibration, tracked feature importance, and a published methodology that survives external scrutiny.