What We Set Out To Prove
The thesis: in agentic AI, the highest-leverage point for recursive self-improvement is the evaluator (the system that judges agents), not the agent itself. Improving the evaluator should compound trust calibration faster than the actor's own capability gain, because the evaluator's gain is multiplied across every agent it grades.
If true, the implication is: invest in evaluator RSI, not actor RSI.
The claim is attractive. It inverts the dominant industry bet, it re-uses substrate that Armalo already has, and it has a clean falsification path. So we measured it on production data, with the cleanest possible cross-metric Brier. The result is not the 2–4× multiplier the thesis projected. It is a 0.34% Brier reduction in a narrow band of conditions, with multiple conditions under which the same machinery is net-negative.
This paper is the honest write-up. The conditions where it works are the contribution; the conditions where it fails are the warning.
The Substrate
dream_outcome_records is Armalo's production ledger of revenue-metric forecasts vs outcomes for the agent swarm. Every row is one (predicted_delta, actual_delta) pair for one (role, revenue_metric) tuple. 1,258 rows, 39 days (2026-05-07 to 2026-06-15), 7 revenue metrics, 5+ roles.
The "evaluator" is whatever generated predicted_delta — typically a per-metric forecasting model that runs before the agent swarm acts. The "outcome" is the actual delta observed at horizon_days (3, 5, or 7 days). The "residual" is actual_delta − predicted_delta.
The dataset is dominated by systematic over-prediction: 868 of 1,258 rows (69%) are classified missed, 342 (27%) are beat, 42 (3%) are regressed, and only 6 (0.5%) met the predicted delta exactly. Mean absolute residual is 5,000–16,000 depending on the week. This is exactly the regime where an evaluator self-improvement loop should extract the most calibration signal — the static rubric is consistently wrong in the same direction.
The Three Arms
We replay the dataset three ways, plus a family of ablations, on a rolling-window cross-validation (train on rounds 0..k, evaluate on round k+1, average across 7 folds). The rolling window is the right time-series protocol; a single train→held-out split biases the held-out toward whatever regime the most-recent week happened to be.
Arm A — static. Predict = predicted_delta. No correction. Baseline.
Arm B — fixed-prior (per-metric and per-segment). A one-shot bias is fit on the training fold: bias[m] = mean(residual) per metric (or per (role, metric) tuple). Frozen thereafter. This is the "Bayesian update once, then stop" control. It uses outcome data but never updates.
Arm C — jrip (per-metric and per-segment). A bias term is updated each round via a gradient step on observed residuals: bias[segment] += α · mean(residual). We sweep α ∈ {0.01, 0.05, 0.10, 0.20, 0.30}. Per-segment means per (role, revenue_metric); per-metric collapses to the metric level.
We also ran Platt-style (bias + scale) updates, where the prediction is s · predicted + b per segment. Without regularization, Platt explodes (see Ablations).
Metric
For cross-metric comparability, we z-normalize both signals on their own per-metric distribution (predicted on predicted's stats, actual on actual's stats) and compute Brier as mean((z_pred − z_actual)²). The first version of this script used actual's stats for both, which produced a spurious 26% Brier-reduction headline — the bias term, in raw units, happened to shift the prediction's z-score toward the actual's mean. Once we normalize predicted on its own distribution, the strong signal evaporates. This is the most important honesty note in the paper: the 26% number is a normalization artifact. The real signal is what we report below.
Result — The Three Conditions
| Arm | Mean rolling-window Brier(z) | Δ vs static |
|---|---|---|
| jrip per-segment, α=0.05 | 1.4602 | −0.34% |
| jrip per-segment, α=0.01 | 1.4632 | −0.14% |
| jrip per-segment, α=0.10 | 1.4650 | −0.01% |
| static (baseline) | 1.4652 | 0.00% |
| fixed-prior per-segment | 1.4652 | 0.00% |
| jrip per-metric, α=0.01 | 1.4697 | +0.31% |
| jrip per-metric, α=0.05 | 1.4902 |
Three conditions emerge from the ablation:
- 1.Per-segment granularity is required. Per-metric updates are net-positive only at the lowest learning rate and turn net-negative above α=0.01. The metric-level bias is over-applied because it cannot represent the role-conditioned structure of the data.
- 2.Small learning rate is required. α=0.05 wins; α=0.20 is net-negative on per-segment; α=0.30 is net-negative on both per-segment and per-metric. Online gradient descent on a non-stationary distribution diverges if the step is too large.
- 3.Platt-style (bias + scale) is unstable without regularization. The scale term grows multiplicatively. We did not run a regularized Platt sweep in this experiment; it is a follow-up.
Why The Signal Is Small
Three honest reasons the 0.34% reduction is small:
- 1.6 weeks of data, 1,258 rows. Per-segment cells have ~15–25 rows per round, which is at the edge of stable bias estimation. With 5× the data, the signal-to-noise ratio should improve.
- 2.A heavily over-predicted distribution. When the static rubric is consistently wrong in one direction, a simple bias correction captures most of the signal in a single step. The residual after the first update is dominated by noise, not structure.
- 3.The metric is Brier on z-normalized values, not raw Brier. A naive raw Brier shows 16–26% "improvement" from JRIP, but that number is a normalization artifact (see Metric section). The honest cross-metric Brier is the right yardstick, and on it the gain is 0.34%.
The Evaluator Multiplier Is The Wrong Shape
The strong thesis ("improving the evaluator is 3–7× more leveraged than improving the actor") does not survive clean measurement. What survives:
- Evaluator RSI is real but conditional. It works in a narrow band: per-(role, metric) granularity, small learning rate, no scale term. Outside that band, it is net-negative.
- The cost of getting it wrong is large. Per-metric updates, fixed-prior at the metric level, and unregularized Platt all hurt Brier by 1–9%. The naive implementation is worse than no implementation.
- The path forward is gated. A promotion gate (only apply a new bias if a held-out split improves) is the right defense. The "Evaluator Multiplier" headline is replaced with "Evaluator RSI, gated, per-segment, with a small step, helps 0.34% on production data; without the gates, it hurts."
The Public Artifact: JRIP v0.1
We publish the Jury Recursive Improvement Protocol (JRIP) v0.1 as a portable spec (see docs/labs/2026-06-15-jrip-v0.1.md). JRIP v0.1 specifies:
- Inputs: a stream of (segment_key, predicted, actual) tuples.
- Update rule: per-segment bias,
α · mean(residual)gradient step. - Promotion gate: a held-out split (default 20%) must show Brier improvement before the new bias is applied.
- Rollback contract: a per-segment bias version, one-command rollback to the previous version, audit trail of every promotion.
- Default α: 0.05.
- Default granularity: per (role, revenue_metric) — the smallest unit at which the data is dense enough to be informative.
Other labs can adopt JRIP without Armalo's data. The protocol is substrate-agnostic; the experiment is substrate-specific.
What This Paper Does Not Claim
- That the Evaluator Multiplier (3–7× compound gain) is real. It is not. It is a projection that the data does not support.
- That recursive self-improvement of the evaluator is always net-positive. It is not. Three conditions must hold.
- That a single one-shot fixed-prior is sufficient. The per-metric fixed-prior HURTS Brier by 9%. The per-segment fixed-prior is neutral. Only the per-segment gradient update is net-positive.
- That Platt-style (bias + scale) is the right parametric form. It explodes without regularization; a regularized version is a follow-up.
What We Are Doing Next
- Re-run with 3–6× more data. The 6-week window is short. With 6 months of data, per-segment cells will be dense enough that the per-segment gradient update should extract more signal.
- Regularized Platt (decay toward (1, 0)). The unregularized version explodes; a decayed version should be a stronger baseline.
- Long-horizon outcome signals.
dream_outcome_recordsuses 3–7 day horizons. Longer-horizon outcomes (30 days) are more stable and should produce a stronger calibration signal. - Live A/B on opted-in agents. The replay is offline; a live A/B (per the original plan) is the only way to confirm the 0.34% reduction translates to a real-world outcome signal.
Replication
Source: scripts/research-experiments/jrip-brier-replay.mjs (headline 80/20 split), scripts/research-experiments/jrip-ablations-rolling.mjs (rolling-window ablations).
Raw data: apps/web/content/research/data/2026-06-15-the-judge-that-learned/
headline.json— the 80/20 train→held-out result (retained for transparency; the rolling-window is the primary result).held-out-*.json— per-arm held-out Brier, MAE, ECE.rounds-*.json— per-round Brier, MAE, by-metric.ablations-rolling-window.json— the full rolling-window sweep across all ablations and α values.
To reproduce:
node scripts/research-experiments/jrip-brier-replay.mjs
node scripts/research-experiments/jrip-ablations-rolling.mjsBoth scripts pull from dream_outcome_records via Neon HTTP. They are idempotent; the JSON output overwrites itself on each run.
Citation
Armalo Labs Research Team. (2026, June 15). *Evaluator Recursive Self-Improvement in Production: 0.34% Brier Reduction, and the Three Conditions Required to Get It.* Armalo Labs.