The Loop That Is Supposed To Close
The promise of a self-improving agent is a closed loop with no human in it: the agent produces work, inspects its own output, finds the defects, and revises โ and the revised version is *better*, not merely different. Run the loop enough times and quality climbs on its own. This is the mechanism underneath every "autonomous," "self-healing," and "recursively self-improving" agent claim shipping today.
The loop closes only if one assumption holds: that the agent can detect its own errors well enough to correct them. This experiment measured that assumption directly, by running the identical revision loop two ways โ once where the agent must find its own errors, and once where a deterministic checker hands it the list of what failed โ and comparing how far each climbs. The assumption does not hold the way practitioners hope. Unanchored self-revision captured less than half the repair the anchored loop captured, the two trajectories *diverged* with each round instead of converging, and the reason was not that the model corrected poorly. It was that the model, looking at its own wrong answer, usually saw nothing wrong.
The Experiment
Design. One model โ MiniMax-M3, a commercially available reasoning model, temperature 0.3 โ completes a battery of structured data tasks, each carrying 7โ11 deterministically checkable constraints (mean 9.3). The constraints mix surface formatting (key sets, array lengths, ordering, word limits) with computation (string reversal, character and vowel counts, acronym construction, arithmetic sums), because computation is where token-based models fail at useful rates. Every task produces a shared round-0 output from a single author call. That output is then revised for three rounds under two regimes that share the same starting point and differ in exactly one factor:
- Self (unanchored): "Review your prior JSON against the constraints and improve it; return it unchanged if you believe it is already correct." The model is never told *which* constraints fail. It must detect its own errors.
- Checker (anchored): a deterministic JavaScript checker computes which constraints fail and names them; the model repairs against that feedback.
Ground truth for every constraint, in both arms, every round, is computed by code โ never by model judgment. In the Self arm the checker result is recorded but withheld from the model. Because both arms revise the identical round-0 output of the identical model, any difference in where they end up is attributable to one thing: the presence of an external ground-truth signal.
Scale. 40 tasks, 209 model calls, 705,264 tokens consumed (431,664 of them reasoning tokens). The battery is generated from a fixed seed and committed with the script; the entire experiment reruns from one file. One checker-arm revision (of 209 calls) failed to parse and is counted; the Self arm had zero parse failures.
The Substrate: What The Model Got Wrong
At round 0 the model produced fully correct output on 18 of 40 tasks (45%). The 22 failures are the substrate for the comparison โ and they are unusually clean. Every one of the 22 task-level failures involved the same constraint: character-by-character string reversal. Formatting constraints โ slugs, array lengths, ordering, word counts โ were satisfied at round 0 essentially without exception. Reversing a 40-to-60-character string correctly is genuinely hard for a model that operates on tokens, and it is the single operation that broke.
This narrows the experiment honestly: it is not a study of self-correction across all error types, but of self-correction on one hard, verifiable operation that the model fails at a measurable rate. That narrowness is a feature. It means the round-0 failures are real, deterministic, and identical in kind, so the only variable left is whether the model can *find* them.
The Headline: A Widening Ceiling
| Revision regime (same model, same round-0 outputs) | Repaired (of 22 round-0 failures) | Final all-pass (of 40 tasks) |
|---|---|---|
| Self โ unanchored, must detect own errors | 6 (27.3%) | 23 (57.5%) |
| Checker โ anchored, told which constraints failed | 14 (63.6%) | 32 (80.0%) |
| *Recursive self-improvement ceiling gap* | 36.4 points | 22.5 points |
The paired comparison on the 22 failures is one-directional. In 9 cases the checker arm repaired a failure the self arm could not; in only 1 case did the self arm repair one the checker arm could not. Exact two-sided McNemar p = 0.0215. Unanchored self-revision captured 42.9% of the repair the external anchor captured โ less than half.
The trajectory is the sharper result. All-pass counts over the four states (round 0 โ 3):
- Self: 18 โ 20 โ 22 โ 23
- Checker: 18 โ 25 โ 28 โ 32
The two loops start at the same place and *diverge*. Each additional round of unanchored self-revision adds roughly one repair across the whole battery; each round of anchored revision adds two to three. The gap at round 1 is 5 tasks; by round 3 it is 9. A self-improvement loop that runs longer does not catch up to the anchored loop โ it falls further behind, because it has already corrected the few errors it can perceive and is now revising outputs it believes are already correct.
Observation. Unanchored self-revision improves, but slowly, and toward a ceiling well below what the same model reaches with an external signal. Mechanism. See below โ the ceiling is a detection limit, not a correction limit. Implication. "It self-improves" and "it self-improves toward correctness" are different claims. The first is cheap; the second requires an external referent the loop does not contain.
The Mechanism: A Detection Ceiling, Not A Correction Ceiling
The natural reading of "self-revision fixed only 27%" is that the model is a weak corrector. The data says otherwise. Of the 22 round-0 failures, the self-revising model never changed a single field on 16 of them โ across all three rounds, the output it submitted was byte-identical in its verdict profile to the output it started with. It was not asked to stop; it was asked to find errors and improve. It looked at its own incorrect string reversal, decided the work was done, and returned it unchanged.
Where the model *did* act, it was effective: of the 6 failures it edited toward, it repaired 6. The bottleneck is not the repair step. It is the perception step. The self arm produced only 2 verdict-changing revisions per round across all 22 failures, and not one of those revisions lowered the verified pass count โ the model is a conservative, accurate editor that almost never engages, because it almost never sees a problem.
This is the same phenomenon Armalo Labs measured in [the Zero-Bit Self-Audit](/research/2026-06-11-zero-bit-self-audit): asked to judge whether its own output satisfied a constraint, the model's self-report was a near-constant "yes." Here that constant has a downstream cost made concrete โ a self-report that never says "failed" produces a revision loop that never revises. The external checker's entire contribution is to convert that silence into a specific, actionable "constraint 7 failed," and that single bit of detection is worth 36 points of repair.
It also explains why even the anchored arm stopped at 64% rather than 100%: told exactly which constraint failed, the model still could not always reverse the string correctly. That residual is the genuine *capability* limit, shared by both arms. The decomposition is clean:
- Detection ceiling (positional, recoverable): the failures the self arm leaves untouched because it cannot perceive them. An external verifier recovers these.
- Correction ceiling (capability, shared): the failures both arms attempt but neither can fix. Only a stronger model, a tool, or a different representation recovers these.
A self-improvement loop with no external anchor is capped at the detection ceiling. In this study that cap cost more than half the available repair.
Reusable Framework: The RSI Anchor Decision
The result generalizes into a decision a builder can make before shipping any self-improving loop. The question is not "is my agent good at self-correction?" โ it is "does the thing my loop must detect lie above or below the agent's detection ceiling?" The two failures decompose differently and demand different remedies.
| Detection-limited failure | Correction-limited failure | |
|---|---|---|
| Symptom | Loop converges fast to a wrong answer it is confident about; revisions stop early | Loop keeps editing, never lands; revisions thrash |
| What the agent does | Submits unchanged; "looks correct to me" | Repeatedly rewrites the same field |
| Recovered by | An external verifier (deterministic check, second-seat reviewer, tool output) | A stronger model, a tool that performs the operation, or a different representation |
| NOT recovered by | More self-revision rounds (the model already cannot see it) | More verifier feedback alone (the model sees it but cannot fix it) |
| Loop design rule | Bind the loop to a deterministic proof gate before promoting any revision |
Anchor Decision rule. If any defect your loop must catch is *checkable by code but not reliably perceived by the model*, the loop must include that code as a gate. Unanchored self-revision will report success over exactly those defects, because the same blind spot that produced the defect suppresses its detection. The cost of omitting the anchor is not zero improvement โ the loop will show motion and partial gains โ it is a ceiling, invisible from inside the loop, that sits below correctness.
This is why Armalo's agent harness does not let a self-improvement loop promote a change on the agent's own say-so. A HarnessLoopPolicy must declare deterministic proof gates โ a command and a success criterion โ before any revision is promoted, and recursive self-improvement is bound to a contract with rollback and an external pass/fail signal rather than a self-report (packages/agent-runtime/src/mission-spine/loop-policy.ts). This experiment is the empirical justification for that constraint: with the proof gate, the loop reached 80%; on self-judgment alone, 57.5%, and it was slowing down.
What This Does And Does Not Claim
This is a within-model case study on one task family, and its scope should be read precisely.
- One model, one operation. All round-0 failures were string-reversal errors from MiniMax-M3. The 36.4-point gap is specific to a hard, code-verifiable operation this model fails at and cannot self-detect. A different model, or a defect class the model *can* perceive, would show a smaller gap. The contribution is the paired decomposition and the detection-vs-correction distinction, not a universal constant.
- The anchor here is perfect. The checker is deterministic ground truth. Real verifiers are imperfect; a noisy verifier recovers less than this clean one. The result is an upper bound on what an external anchor buys, and a *lower* bound on how badly unanchored self-revision underperforms relative to an attainable target.
- Conservative editing is itself a finding, not a confound. The model's reluctance to change a "finished" output is precisely the detection ceiling in action; it is what the experiment set out to measure, not noise obscuring it.
Falsification. The claim โ that unanchored self-revision is detection-limited and an external anchor raises the ceiling โ is falsified if, on a defect class the model fails at but the experiment shows it *does* repeatedly attempt to revise, the self and checker arms converge to the same final accuracy. We did not observe that here; the self arm declined to engage. A model that aggressively self-revises and still cannot close the gap would relocate the ceiling from detection to correction โ a different paper, equally publishable.
This result sits alongside the broader finding in the literature that intrinsic self-correction without external feedback can stall or degrade reasoning performance ([Huang et al., *Large Language Models Cannot Self-Correct Reasoning Yet*, ICLR 2024](https://arxiv.org/abs/2310.01798)), and the argument that self-improvement loops which rely on their own outputs as the truth signal risk drift absent an external grounding oracle ([*On the Limits of Self-Improving in LLMs*, 2026](https://arxiv.org/html/2601.05280v1)). Our contribution is a deterministic, paired, within-model measurement that separates the recoverable (detection) component from the irreducible (correction) one, and prices the anchor at 36 points.
Replication
Live controlled experiment. Battery generated from seed 2026061501; model MiniMax-M3 at temperature 0.3; 40 tasks, 3 revision rounds, two paired arms. Measurement script: scripts/research-experiments/rsi-ceiling-2026.mjs. Raw output, including every per-round verdict profile and the full task battery: apps/web/content/research/data/rsi-ceiling-2026.json. Re-run with:
source .env node scripts/research-experiments/rsi-ceiling-2026.mjs pnpm research:audit
The primary endpoint (paired final-state McNemar on round-0 failures), the round trajectory, the category decomposition, and the per-round churn counts are all computed in the script and emitted to the data file; every number in this paper is read from that file.