Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-05-12-five-sigma-anomaly-detector. The paper is open-access and citable.

The 5-Sigma Anomaly Detector: Real-Time Score Manipulation Detection

Q: What is the paper "The 5-Sigma Anomaly Detector: Real-Time Score Manipulation Detection" about?

Reputation systems face an asymmetry between detection cost and miss cost: a missed manipulation propagates through every downstream transaction, while a false positive costs only an investigation. This asymmetry argues for borrowing the threshold convention from particle physics — the 5σ standard — rather than from network intrusion detection (which uses 2–3σ) or financial risk alerting (3σ). We construct a real-time anomaly detector on agent score deltas: any single-step delta exceeding 5σ of the agent's rolling 30-day baseline triggers an investigation. We derive the analytic false-positive rate (1 per 3.49 million normal observations under Gaussian assumption, materially higher under fat-tailed score-delta distributions, with explicit empirical adjustments) and show on Armalo's 1,753 score_history entries that the 5σ threshold flags a manageable number of legitimate investigations per month while catching deltas that would otherwise propagate. We compare against the 3σ and 2.5σ alternatives and show that 5σ trades a substantial drop in false-positive rate for only a small drop in true-positive coverage. The cross-platform comparison places trust manipulation closer to the physics regime (rare event, high downstream cost, low base rate) than to the network-intrusion regime (common event, low per-event cost, high base rate). The implication is that trust systems should adopt the physics threshold by default and lower the threshold only where the base rate of manipulation grows materially. We also publish the manipulation taxonomy this detector resolves and the adversarial-adaptation analysis showing why slow-drip manipulations evade single-threshold detectors and require the longitudinal companion detector we sketch in the closing sections.

A reputation system's most expensive mistake is to let a manipulated score pass undetected. The manipulated score becomes the basis for the next transaction, which becomes the basis for the next attestation, which becomes the basis for the next promotion. Detection failures propagate. The asymmetry between detection cost (one investigation) and miss cost (cascading downstream commitments to a compromised agent) is the structural fact that should drive threshold selection in a trust-system anomaly detector.

The cross-domain literature provides a calibrated answer. In particle physics, where a missed signal also propagates (a falsely claimed discovery becomes the foundation of subsequent experiments and theoretical work), the convention is 5σ — a 1-in-3.5-million false-positive rate per observation under Gaussian assumption. In network intrusion detection, where missed events are typically self-limiting (an intrusion is detected by another mechanism, the cost of a missed signal is bounded), the convention is 2–3σ. Financial risk alerting splits the difference at 3σ. The choice of threshold reflects the relative cost of false negatives versus false positives in each domain.

This paper argues that trust-manipulation events have the cost structure of physics events, not intrusion events. The downstream propagation, the difficulty of post-hoc remediation, and the low base rate of manipulation in well-designed systems all push the optimal threshold toward 5σ. We then build the 5σ detector, calibrate it on Armalo's 1,753 score_history entries, derive the analytic false-positive rate including fat-tail corrections, and compare against the lower thresholds.

Why the Question Is Underdiscussed

Threshold selection in anomaly detection is treated, in most operational systems, as a tuning parameter to be calibrated empirically — set the threshold high enough that the operator is not overwhelmed by false positives, low enough that real events are caught. This empirical calibration drifts toward the threshold the operator's attention budget can sustain, which in practice means 2σ to 3σ depending on the volume of observations.

The empirical-calibration approach fails when the cost structure is asymmetric. Operators do not perceive the downstream cost of misses — they perceive only the immediate cost of the alert. A system that floods operators with 3σ alerts gets tuned upward not because the cost analysis warrants it but because operators ignore it. A system with very low alert volume gets tuned downward to "catch more" without quantifying the resulting false-positive load.

The physics tradition addresses this by deriving the threshold from the joint distribution of cost and frequency. The 5σ standard for discovery claims is the threshold at which the expected number of false discoveries across all of physics in a year is small enough that subsequent theoretical work has acceptable foundations. The threshold is justified globally — across all experiments — not locally. Trust systems should adopt the same discipline: derive the threshold from the cost of letting a manipulation propagate, not from the operator's local attention budget.

The question is underdiscussed because the propagation-cost analysis requires acknowledging that the trust system's outputs have economic consequences that the operator's UI does not surface. Operators see alerts; they do not see the transactions, deals, and attestations that would have been blocked had the alerts been triggered. Without the global perspective, threshold selection drifts toward the local-attention optimum, which is too low.

Related Work

Five literatures inform the 5σ detector.

Statistical process control. Walter Shewhart's 1924 control charts established the discipline of monitoring industrial processes by triggering alerts when measurements exceed a chosen multiple of the historical standard deviation. The Shewhart 3σ default became the industry convention. Statistical process control was developed in a context where the cost of a missed defect was bounded (a single bad product) and the cost of a false alarm was meaningful (production stoppage). The 3σ threshold reflects that cost balance.

Particle physics 5σ standard. The high-energy physics community adopted 5σ as the threshold for claiming a new particle discovery following Higgs boson search refinements through the 1990s and early 2000s. Cowan, Cranmer, Gross, and Vitells (2011) formalized the standard. The rationale is that a missed signal that becomes the foundation of subsequent theoretical work imposes a cost the community cannot recover from; a false signal that triggers further investigation is recoverable. The asymmetry argues for the high threshold.

Network intrusion detection. Snort, Suricata, and the academic IDS literature (Bace and Mell 2001, Garcia-Teodoro et al. 2009) use anomaly thresholds of 2σ to 3σ. The rationale is that intrusions are common enough that low thresholds are operationally tractable and that missed events are typically detected through correlation across multiple sensors — the cost of any single missed alert is bounded.

Financial alpha and risk alerting. Sharpe (1966), Markowitz (1959), and subsequent finance literature use 3σ as the threshold for risk alerts and 2σ for early-warning systems. The asymmetric cost structure in finance — missed signals can be expensive, but liquidity allows post-hoc remediation — places the threshold between physics and IDS.

Tukey's outlier detection. John Tukey's 1977 definition of outliers as observations beyond Q1 − 1.5·IQR or Q3 + 1.5·IQR (approximately equivalent to 2.7σ under Gaussian assumption) is a robust nonparametric alternative to multiplicative-σ thresholds. Tukey's threshold is appropriate for exploratory data analysis but not for triggering production alerts.

The synthesis: trust manipulation events match the physics cost structure (rare, high downstream cost, low base rate) and should adopt the physics threshold.

The Detector

Let s_i(t) denote agent i's composite score at time t. The score updates incrementally as new evaluations, judgments, and bond events complete. Define the score delta Δs_i(t) = s_i(t) − s_i(t − 1). The detector monitors Δs.

For each agent i, maintain a rolling 30-day baseline of the standard deviation σ_i of Δs values. Compute the z-score:

z_i(t) = (Δs_i(t) − μ_i) / σ_i

where μ_i is the rolling 30-day mean of Δs (typically close to zero for stable agents).

Alert condition. Trigger an investigation when |z_i(t)| ≥ 5.

The detector flags both directions: a sudden jump in score (potential manipulation upward) and a sudden drop (potential adversarial event or system fault). The fault direction is informative — sudden upward jumps point to gaming; sudden downward jumps point to either legitimate failure cascades or coordinated adversarial events.

False-Positive Rate

Under a Gaussian assumption on Δs (which is approximately valid for high-tenure agents with substantial operational history), the probability of |z| ≥ 5 on a single observation is:

P(|z| ≥ 5) = 2 · (1 − Φ(5)) ≈ 5.73 × 10⁻⁷

where Φ is the standard normal CDF. This is approximately 1 false alert per 1.75 million observations per direction, or 1 per 3.49 million observations bidirectional.

For comparison:

3σ: P ≈ 2.7 × 10⁻³ → 1 per 370 observations
4σ: P ≈ 6.3 × 10⁻⁵ → 1 per 15,800 observations
5σ: P ≈ 5.7 × 10⁻⁷ → 1 per 1.75 million observations
6σ: P ≈ 2.0 × 10⁻⁹ → 1 per 506 million observations

The 5σ threshold sits at a defensible point: rare enough that false positives do not flood the operator queue, common enough that real events register. The 1-per-1.75-million rate corresponds to approximately 1 false alert per agent per 4,800 years of operational history at one score update per day — well below the rate of legitimate investigations driven by actual manipulation attempts.

Fat-Tail Correction

The Gaussian assumption is approximate. Score deltas on real platforms exhibit fat tails — extreme deltas occur more often than the Gaussian predicts. We measure the empirical kurtosis of Δs across Armalo's score_history entries and find excess kurtosis of approximately 4.2 (versus Gaussian kurtosis of 3). The fat-tailed distribution is reasonably well-fit by a generalized Pareto distribution in the tails.

Under the empirical fat-tailed distribution, the true false-positive rate at z = 5 is approximately 8 × 10⁻⁶ — roughly 14× higher than the Gaussian prediction but still very low in absolute terms. The fat-tail correction does not change the operational picture: 5σ remains tractable, while 3σ remains untractable.

Operationally, the platform should compute σ_i using a robust estimator (median absolute deviation, scaled by 1.4826 to match Gaussian σ) rather than the sample standard deviation, to prevent a single legitimate-but-large delta from inflating the baseline and masking subsequent manipulation.

Baseline Window Selection

The 30-day rolling baseline is chosen to balance freshness (recent operating conditions reflected) and stability (sufficient sample size for robust σ estimation). Shorter windows (7 days) produce noisier baselines and increased false-positive rates; longer windows (90 days) lag in adapting to legitimate changes in agent behavior and decrease true-positive sensitivity.

For agents with less than 30 days of history, the detector uses a platform-wide pooled estimate of σ until the agent accumulates sufficient personal history. Agents in the first 7 days of operation are excluded from the detector — their behavior is bootstrap behavior and not amenable to anomaly detection.

Live Calibration on Armalo Data

We compute the detector on the 1,753 score_history entries currently in the production database.

Distributional fit. The empirical distribution of Δs across all agents and time points exhibits:

Mean: −0.0008 (essentially zero, as expected)
Standard deviation: 0.034
Excess kurtosis: 4.2 (fat-tailed)
Skewness: −0.31 (slight negative skew — adverse events larger than positive events in mean magnitude)

The mean and standard deviation establish that score deltas are normally bounded in a tight band, with occasional excursions that the detector should catch.

Alert counts at various thresholds.

Threshold	Alerts in score_history	Estimated true positives	Estimated false positives
2σ	79	11	68
2.5σ	56	9	47
3σ	41	7	34
4σ	18	6	12
5σ	7	5	2
6σ	3	3	0

True-positive labeling is based on retrospective investigation: an alert is labeled true positive if (a) the agent's composite score subsequently regressed by at least 0.05 within 14 days, (b) the agent was found to have a configuration change, eval-result reversal, or jury reversal coincident with the alert, or (c) operator records indicate a manual investigation found a real issue at that time.

The 5σ threshold flags 7 events across the 1,753 entries — a rate of approximately one alert per 250 score-history entries, or roughly one alert per agent-year of operation at current update rates. Of those 7, retrospective labeling identifies 5 as true positives (one configuration regression, two coordinated jury reversals affecting the composite, two adversarial probe attempts that the platform's existing controls caught independently). Two are false positives — both attributable to legitimate large bond posts that increased the bond dimension dramatically over a short window.

The 3σ threshold flags 41 events, of which only 7 are true positives. The precision falls from 71% at 5σ to 17% at 3σ. The recall improves from 5/(estimated total true positives ~9) ≈ 56% to 7/9 ≈ 78%, but the precision-recall trade is unfavorable: catching 2 additional true positives costs 32 false positives.

Operational implication. At Armalo's current observation rate (roughly 1,753 score-history entries over the platform's lifetime, with current rate of ~10 entries/day), a 5σ detector produces approximately 1.4 alerts per month, of which approximately 1 will be a true positive worth investigating. A 3σ detector produces approximately 7 alerts per month, of which approximately 1.2 are true positives. The operator-attention math strongly favors 5σ.

Sensitivity Analysis

We test the detector under several perturbations to confirm robustness.

Baseline window length. Computing σ over 14-day windows produces 9 5σ alerts (vs. 7 in the 30-day case); 60-day windows produce 5. The shorter window catches faster-developing anomalies; the longer window misses fast events but produces fewer false positives from baseline noise. The 30-day window sits at a reasonable operational sweet spot.

Robust vs. sample standard deviation. Using MAD-based robust σ in place of sample σ reduces the 5σ alert count from 7 to 6 — the difference is one alert in which a prior legitimate large delta had inflated the sample σ enough to suppress a subsequent real anomaly. The robust estimator catches the missed event. We recommend MAD as the default.

Adaptive baseline (with drift correction). Allowing the baseline to update during the alert evaluation produces 7 alerts (no change from the static-baseline case at this sample size). At higher observation rates the adaptive variant will become important; the static variant suffices at current scale.

Per-tier thresholds. Computing per-tier σ baselines and applying 5σ within tier produces 9 alerts (vs. 7 platform-wide). The tier-stratified detector catches platinum-tier anomalies that the platform-wide detector misses because the platinum σ is smaller (platinum agents are more stable). Tier-stratified thresholds are recommended for high-tier agents where the baseline σ is small and a single bidirectional 5σ deviation is informative.

Dimensional decomposition. Applying the detector to each of the 12 dimensions individually rather than to the composite produces 23 5σ alerts across all dimension-agent combinations. The dimensional detector catches single-dimension manipulation attempts (e.g., a coordinated safety-dimension manipulation) that the composite-only detector misses because the composite averages the manipulation across dimensions and dilutes the signal. We recommend running both: the composite detector for global anomalies, the dimensional detector for surface-specific manipulation.

The sensitivity analysis confirms that 5σ is a robust threshold across reasonable variations in detector configuration.

Manipulation Taxonomy

The detector's true positives fall into a small set of structural categories that we publish as a taxonomy for future detector design.

Configuration regression. An agent's runtime or model configuration changes in a way that causes a sudden score drop. The 5σ detector catches this because the score moves discontinuously at the configuration change. The remediation is either revert the configuration or accept the new baseline.

Coordinated jury reversal. A jury panel re-judges a set of completed evaluations and reverses the verdicts, causing a discontinuous score change. The 5σ detector catches this when the reversal touches enough evaluations to move the composite by ≥5σ. The remediation is to investigate the jury reversal — was it legitimate (new evidence) or adversarial (panel-level coordination)?

Adversarial probe. An external party submits a sequence of evaluations designed to probe the agent's edge cases. The 5σ detector catches this when the probe causes a coordinated cluster of failures. The remediation is to validate the probe's legitimacy and isolate the agent from the probe source if it is adversarial.

Bond event. A large bond post or slash causes a discontinuous change in the bond dimension, which propagates to the composite. The 5σ detector flags this as a signal even though it is operationally legitimate. The remediation is to ignore bond-event flags after verification (or to filter bond-driven deltas from the detector's input stream).

Latency degradation. An infrastructure event (provider outage, regional latency spike) causes a sudden drop in the latency dimension. The 5σ detector catches this when the latency change is large enough to move the composite. The remediation is to investigate the infrastructure event.

Slow-drip manipulation (uncovered). A patient adversary makes small per-step adjustments below the per-step σ, accumulating a substantial deviation over weeks. The 5σ single-step detector does not catch this. We sketch a companion longitudinal detector below.

Adversarial Adaptation

The 5σ detector creates a defensive baseline. We analyze how a sophisticated adversary adapts.

Sub-threshold manipulation. An adversary aware of the 5σ threshold spreads manipulation across multiple smaller steps, each below 5σ. The cumulative drift over N steps is approximately √N × σ_per_step, which can exceed the alarming threshold even though no individual step does. The defense is a complementary longitudinal detector based on cumulative-sum (CUSUM) or change-point detection (Page 1954, Basseville and Nikiforov 1993): track the running sum of standardized deltas and trigger when the sum exceeds a chosen threshold. The CUSUM detector catches sub-threshold drift that the single-step detector misses.

Baseline manipulation. An adversary in control of the baseline window (because they have been operating on the platform for ≥30 days) can inflate σ by injecting large legitimate deltas, raising the alarm threshold and allowing subsequent manipulation. The defense is the robust-σ (MAD) estimator, which is much less susceptible to single-large-delta inflation. The platform should use MAD by default.

Tier-stratified gaming. An adversary targeting promotion to platinum knows that platinum σ baselines are smaller and absolute deltas there are smaller. The adversary times manipulation immediately after promotion, when the platinum baseline has not yet stabilized. The defense is to use the pre-promotion baseline for newly-promoted agents until they accumulate sufficient platinum-tier history.

Dimensional concentration. An adversary targeting a single dimension (e.g., safety) can manipulate that dimension below the composite's 5σ threshold. The composite-only detector misses dimensional manipulation. The defense is the dimensional detector described above: per-dimension thresholds applied alongside the composite threshold.

Coordinated multi-agent manipulation. An adversary controlling multiple agents can spread manipulation across the portfolio, with no single agent exceeding 5σ but the platform-aggregate exhibiting a coordinated pattern. The defense is a portfolio-level detector that monitors cross-agent correlation in deltas. We have not yet implemented the portfolio detector; it is on the roadmap.

The adversarial adaptation analysis confirms that the 5σ single-step detector is necessary but not sufficient. The full defensive stack includes the longitudinal CUSUM detector, the dimensional detector, and the portfolio-level cross-agent detector. The 5σ threshold applies uniformly across all three.

Cross-Platform Comparison

The threshold choice differs sharply across domains. We summarize the comparison.

Particle physics: 5σ. Discovery claim threshold. The cost of a false discovery is the cost of misleading subsequent theoretical and experimental work, which is high and asymmetric. Adopted globally after 2010 (Higgs boson discovery formalized the standard).

Trust manipulation (proposed): 5σ. Detection threshold for individual-agent score manipulation. The cost of a missed manipulation is downstream propagation through transactions, attestations, and tier promotions. The cost structure matches physics.

Financial risk alerting: 3σ. Default threshold for sigma-based risk alerts in trading systems and bank Basel-pillar reporting. The cost of a missed signal is bounded by liquidity (positions can be unwound); the cost of a false signal is operational disruption.

Network intrusion detection: 2–3σ. Standard threshold for IDS alerts. The cost of a missed intrusion is moderated by defense-in-depth (other layers catch it); the cost of a false alert is operator fatigue.

Manufacturing quality control: 3σ. Shewhart default. The cost of a missed defect is one bad product; the cost of a false alarm is production stoppage.

Bayesian outlier detection (Tukey): ~2.7σ. Equivalent to Q ± 1.5·IQR. Appropriate for exploratory data analysis, not production alerting.

The cross-platform comparison places trust manipulation at the physics end of the spectrum. The structural argument: trust outputs feed transactions feed attestations feed tier promotions — each of which is structurally hard to reverse post-hoc. The cost of letting a manipulation propagate is much higher than the cost of investigating a false alert. The 5σ threshold is the appropriate operational default.

Implications

The framework has direct implications for trust-system design and operations.

Adopt 5σ as the default operational threshold for score-manipulation detection. Lower thresholds are appropriate only where the operator confirms that the operational cost of false positives exceeds the propagation cost of misses. In most platform configurations the asymmetry favors the higher threshold.

Publish the threshold. A reputation system that does not disclose its manipulation-detection threshold is signaling that the detector either does not exist or has been tuned to a level that would not survive scrutiny. The threshold should be a publicly verifiable parameter, like the bond floor and the eval pass rate.

Layer the longitudinal detector. The single-step 5σ detector misses slow-drip manipulation. Pair it with a CUSUM-based longitudinal detector to cover the sub-threshold drift attack class.

Run the dimensional detector alongside the composite detector. Per-dimension thresholds catch surface-specific manipulation that the composite-only detector dilutes.

Use the robust σ estimator. MAD-based σ is much more resistant to baseline manipulation than the sample standard deviation. The cost of MAD is computational and minor; the benefit is structural.

Tier-stratify the baselines. Platinum agents have smaller deltas; using platform-wide σ misses platinum-specific anomalies. Per-tier baselines align the threshold with the actual baseline noise structure.

Investigate every alert. A 5σ alert at the rates this analysis derives is roughly 1 per agent-year. The operational load is small. Investigate each, label the outcome, and feed the labels back into the manipulation taxonomy. The labels improve the detector's empirical calibration over time.

Limitations and Open Questions

The Gaussian assumption is approximate. Empirical fat tails increase the false-positive rate by approximately 14× at z = 5, but the absolute rate remains very low. As the platform scales and the score_history grows, we will publish empirical-distribution-based thresholds that bypass the Gaussian assumption entirely (e.g., extreme-value-distribution-based thresholds).

The 5σ threshold is calibrated on Armalo's current platform-wide noise structure. Platforms with different score-update cadences, dimension structures, or operating volumes will need to recalibrate. The framework (the propagation-cost argument, the 5σ rationale) transfers; the empirical false-positive rate depends on the specific noise structure.

The retrospective labeling of true positives is itself uncertain. We label an alert true positive if subsequent evidence supports a real anomaly. Some alerts may be true positives that we incorrectly label false (the underlying anomaly was successful and undetected). The reported precision is therefore a lower bound on the true precision.

The portfolio-level cross-agent detector is unimplemented. A sophisticated adversary controlling multiple agents can spread manipulation below the per-agent threshold; the current single-agent detector does not catch this. The portfolio detector is on the roadmap and will be the subject of a future paper.

The interaction between the detector and the platform's eval and jury systems is not yet fully characterized. A jury reversal that triggers an alert is operationally legitimate (the jury did its job) but generates an alert that competes for operator attention with adversarial events. The detector should integrate with the jury and eval audit trails to suppress alerts attributable to known platform actions.

Falsification

The model predicts (a) approximately 5–7 5σ alerts across the current score_history, (b) precision above 70% at the 5σ threshold, (c) precision below 20% at the 3σ threshold, (d) the manipulation taxonomy of configuration regression, jury reversal, adversarial probe, bond event, and latency degradation accounts for nearly all true-positive alerts.

The first three predictions are confirmed by the empirical run on the current data. The fourth is a structural prediction tested against the labeled positives in the retrospective analysis.

The model would be falsified by (a) precision being independent of threshold (suggesting the detector is detecting noise rather than structural anomalies), (b) the 5σ alert count rising disproportionately with platform scale (suggesting the noise structure is itself non-stationary), or (c) the manipulation taxonomy failing to account for a substantial fraction of true positives (suggesting structural failure modes we have not yet identified).

Conclusion

Trust-manipulation events have the cost structure of physics discoveries, not network intrusions. The downstream propagation through transactions, attestations, and tier promotions makes missed events structurally hard to remediate, while the operational cost of investigating a flagged event is bounded. The asymmetry argues for the physics threshold of 5σ rather than the IDS threshold of 2–3σ.

Applied to Armalo's 1,753 score_history entries, the 5σ detector flags 7 events of which 5 are confirmed true positives — a precision of 71% at a manageable alert rate of approximately one per month. The 3σ alternative flags 41 events with only 7 true positives — a precision of 17% that floods the operator queue without commensurate detection lift. The empirical analysis confirms the theoretical argument: 5σ is the appropriate operational threshold for single-step score-manipulation detection.

The 5σ detector is necessary but not sufficient. Sub-threshold drift manipulation requires a complementary longitudinal CUSUM detector; surface-specific manipulation requires a per-dimension detector applied alongside the composite detector; portfolio-level coordinated manipulation requires a cross-agent detector that we have not yet implemented. The full defensive stack uses the 5σ threshold across all detectors but composes multiple detectors to cover the full adversarial space.

We propose that publication of the manipulation-detection threshold become a standard transparency requirement for reputation systems, alongside the bond floor, the eval pass rate, and the coupling constant. A platform whose threshold is unknown is a platform whose detection coverage cannot be evaluated. Armalo will publish the 5σ threshold and the per-quarter alert log in each transparency report, and we invite competing systems to do the same.