Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-06-16-halt-authority-external-anchor. The paper is publicly available and citable.

The Halt Authority: Told to Keep Improving Already-Correct Work, an Unanchored Agent Destroys It 76% of the Time

Q: What is the paper "The Halt Authority: Told to Keep Improving Already-Correct Work, an Unanchored Agent Destroys It 76% of the Time" about?

A self-improving agent that cannot stop is not improving — it is editing. We manufactured 17 verified-correct outputs from a reasoning model, then told the same model to 'keep improving' each one across three rounds under two regimes that differ in exactly one thing: whether the external success criteria stay in view. With the criteria withheld, 'keep improving' damaged 13 of 17 already-correct outputs (76.5%); with the criteria in view, it damaged 0 (exact McNemar p = 0.000244). The unanchored survival curve collapsed round by round — 17 correct, then 6, then 5, then 4 — as the model edited correct fields 57% of the time it was asked to improve, flipping 80 individually-passing constraints to failing. The model's own self-audit did not save it: it declared the work compliant in only 10% of rounds and could not distinguish the rounds it had just broken. One mechanism prevented all of the damage in both arms: a deterministic keep-best promotion gate that refuses any revision scoring below the best verified output so far reduced the self-damage rate to 0%. The design rule: an autonomous improvement loop needs an external halt authority — a verifier that both defines done and vetoes regressions — because the agent's drive to keep working will otherwise consume the correct work it already has.

The Loop That Cannot Stop

The previous Armalo Labs study ([the recursive self-improvement ceiling](/research/2026-06-15-rsi-ceiling-external-anchor)) asked whether an unanchored self-revision loop *climbs* — and found it plateaus far below an externally-anchored loop because the model cannot detect its own errors. This study asks the opposite and more dangerous question: when the work is *already correct*, and the agent is told to keep improving anyway, what happens?

This is not a contrived setup. It is the default condition of every "autonomous," "always-on," "keep-iterating-until-great" agent loop. A loop with no external done-signal does not know it has arrived. It keeps generating, and every generation is a chance to overwrite a correct answer with a worse one. The question is whether that risk is theoretical or load-bearing. We measured it directly, and it is load-bearing: without an external anchor in view, the instruction "keep improving" destroyed correct work three times out of four.

The Experiment

The design isolates one factor: the external anchor. A naive single-arm test of "keep improving" is degenerate — if the prompt re-states the success criteria each round, the model simply re-satisfies them and nothing breaks. So we built a paired two-arm design in which the *only* difference between arms is whether the external criteria remain visible.

Phase A — manufacture a verified-correct start. For each task, the model is given a structured data task with 7–11 deterministically checkable constraints (mean ~9) and repaired against a deterministic checker until a fully correct, all-constraints-passing output is produced (capped at 4 repair rounds). This is the same anchored-repair procedure the ceiling study validated. Only tasks that reach a verified-correct start enter Phase B.

Phase B — tell the model to keep improving, from the same correct start, two ways. Each verified-correct output is then revised for three rounds under two regimes sharing the identical starting output:

- Anchored: the success criteria (the full constraint rubric) stay in view; "improve this while keeping every requirement satisfied." - Unanchored: the rubric is withheld; the model sees only the task topic and the output, and is told to "keep improving it." It no longer knows what "done" was.

Ground truth — how many of the original constraints still pass — is computed by code every round in both arms. Because both arms start from the *same verified-correct output of the same model* and differ only in whether the external definition of done is visible, any damage difference is attributable to one thing: the external anchor.

Scale and honesty. The battery is 40 seeded tasks. Provider rate-limiting (HTTP 429s) during Phase A prevented a verified-correct start from being manufactured for 20 of them — those tasks never entered Phase B and are excluded, reported here, not hidden. Of the 20 tasks that ran Phase A, 17 reached a verified-correct start (85% manufacturing yield, mean 0.65 repair rounds). The paired comparison therefore runs on n = 17 verified-correct outputs, 204 model calls, 603,596 tokens (385,839 of them reasoning tokens). Zero Phase-A parse failures. The reduced N is a real limitation of this run, not the design; the effect that survives it is large.

The Headline: "Keep Improving" Is a Demolition Order Without an Anchor

Each of the 17 verified-correct outputs entered Phase B all-passing. The question is how many were still all-passing after three rounds of "keep improving."

Phase-B regime (same model, same verified-correct starts, n = 17)	Self-damaged (dropped below all-pass)	Self-damage rate
Anchored — success criteria stay in view	0 of 17	0%
Unanchored — criteria withheld, "keep improving"	13 of 17	76.5%

The paired contingency table is almost entirely on one diagonal: 13 outputs damaged by the unanchored arm but not the anchored arm, 0 damaged by the anchored arm but not the unanchored, 4 damaged by neither, 0 by both. The exact McNemar test on the 13 discordant pairs gives p = 0.000244. Removing the external definition of done is not a mild degradation; it converts a correct-work-preservation task into a coin-flip-and-then-some toward destruction.

The Mechanism: Over-Eager Editing of Things That Were Already Right

The unanchored arm did not fail by refusing to act. It failed by acting too much. Across its 51 improve calls it edited an already-correct field 57% of the time it was asked to improve. Those edits were not free: they produced 23 damaging revisions that flipped 80 individually-passing constraints to failing. The survival curve shows the loop eating its own correct work round by round:

Round	Unanchored outputs still fully correct (of 17)	Survival
0 (start)	17	100%
1	6	35.3%
2	5	29.4%
3	4	23.5%

Most of the destruction happens in the very first unanchored round — 11 of 17 correct outputs broke the first time the model was told to "keep improving" without being shown what done meant. The anchored arm, over the same three rounds, stayed flat at 17/17. The instruction is identical in both arms. The only difference is whether the model can still see the target.

The Self-Audit Does Not Save It

A natural objection: surely the model knows when it has broken its own work, and a self-check could gate the bad revisions. It cannot. Across 40 Phase-B self-audits the model declared its output compliant in only 4 (10%) — it was broadly, vaguely unsure — and that uncertainty carried no signal: the self-audit could not separate the rounds in which it had just flipped passing constraints to failing from the rounds in which it had not. Gating on the model's own "is this compliant?" verdict (the self_gated regime) provided no reliable protection, exactly as the ceiling study's zero-bit self-audit result predicts. Self-judgment is the signal that fails; it cannot referee the loop that produced it.

The Fix That Worked: A Deterministic Keep-Best Gate

One mechanism prevented every instance of self-damage, in both arms. A promotion gate — keep, as the shipped output, the highest externally-verified result seen so far, and discard any later revision that scores below it — reduced the self-damage rate to 0% in the unanchored arm and 0% in the anchored arm. By construction it cannot ship a regression: a later round that breaks a constraint scores lower than the verified-correct start, so the gate retains the start. The loop is still allowed to explore; it is simply not allowed to *commit* a worse result than one it has already proven.

This is the actionable core. The unanchored loop is dangerous because it has no veto over its own regressions. Two things give it that veto, and they compose:

1.An external definition of done that stays in view (the anchored arm) — so the model knows what it is preserving.
2.A deterministic keep-best promotion gate (both arms) — so even when the model edits a correct field, the worse output never ships.

The Design Rule

An autonomous improvement loop needs a halt authority: an external verifier that both defines done and vetoes regressions. Concretely, for anyone shipping a "keep iterating" agent:

Do not run an improvement loop with the success criteria out of the model's view. "Make this better" with no visible target is a demolition order three times out of four.
Never accept the loop's latest output as its result. Accept the best *externally-verified* output. Keep-best is not an optimization; it is the safety property that makes the loop monotone.
Do not gate on the model's self-audit. It is uninformative about the damage it just caused. The referee must be external — a deterministic checker, an oracle, or an independent verifier seat — never the author's own claim.

These are not abstractions for us. The Armalo agent harness implements this halt authority natively: a loop that carries a goal does not accept the model's self-reported "done" (the moment it stops calling tools) until an external anchor confirms the goal, and it ships the best externally-verified candidate across re-drives rather than the latest one. This study is the empirical reason that gate exists.

Limitations

One model, one task family. MiniMax-M3 on constraint-bound structured-data tasks. The paired within-model design is the contribution; generality across models and domains is future work, and we invite replication from the committed script and seed.
Reduced N from provider attrition. Rate-limiting cut the paired sample from 40 to 17. The effect (76.5 vs 0 points, p = 0.000244, 13 discordant pairs) is large and well above the threshold for an exact paired test, but a clean re-run at full N would tighten the survival-curve estimates.
"Damage" is defined as dropping below all-pass on the original constraints. A revision that changed the output in a way a human would consider an improvement but that violated an original constraint is counted as damage. That is the correct accounting for a loop whose job was to *preserve* a verified-correct result, but it does not measure open-ended quality gains the unanchored arm might have produced on dimensions the constraints did not cover.
A design confound we did not eliminate: the unanchored arm sees strictly less text (no rubric) than the anchored arm, so a sliver of the effect could be context-length rather than anchor-presence. The mechanism evidence — 57% edit rate on correct fields, damage concentrated in round 1 — points to over-eager editing under a lost target, not context truncation, but a token-matched control (rubric replaced by equal-length filler) would isolate it.

Replication

The measurement script, the seeded battery, the pre-registered endpoints, and every number in this paper are committed at scripts/research-experiments/rsi-halt-authority-2026.mjs and apps/web/content/research/data/rsi-halt-authority-2026.json. Re-running the script reproduces the identical tasks; the only non-determinism is the model API and provider availability.