Empirical Honesty Note
This paper does not claim that the Armalo Agent recursive-self-improvement (RSI) stack is 1000x better than Hermes today. 1000x is the target. What we measured against the live production database on 2026-05-17 is a present-day baseline — significant on some axes (52.5x action diversity), unmeasurable on others (the customer-side agent_variants flywheel has 1 row in 30 days), and short of 1000x everywhere. The paper documents what is measured, what is projected, and what code needs to ship to close the gap.
Every quantitative claim in the prose traces to one of four provenance kinds: measurement (a committed script's output data file), code-ref (a specific file:line in the codebase), derivation (a formula stated in §Replication), or projection (forward-looking estimate explicitly labeled in prose). Numbers that have not been measured are removed or labeled, never invented. The audit gate at scripts/audit-research-claims.mjs enforces this.
Abstract
Recursive self-improvement is the property that an agent's cycle N reasoning is informed by, and meaningfully better than, cycle N−1. We define a six-dimensional RSI capability vector — scan breadth, ranking sophistication, action diversity, verification depth, learning persistence, and compounding layers — and measure both the Hermes admin-swarm baseline (from forensic research, §1) and the current Armalo platform (live DB measurement, §2). The measured present-day improvement ratios are: 52.5x action diversity, 2x scan breadth, 1x skill synthesis throughput (both observed at 0/week in the 7-day snapshot), and single-layer compounding on both sides. With sample N=54 customer-scored agents, the average composite score moves +4.65 points per cycle at $0.06 per quality point. §3 frames why these ratios fall short of 1000x and §4 specifies the four code surfaces (agent_variants flywheel activation, multi-source scanner, causal verifier, second-order compounding) that must ship to credibly claim 1000x.
1. Hermes Baseline — Capability Vector
The Hermes RSI baseline is taken from forensic research at .planning/rsi-1000x/RESEARCH-HERMES-BASELINE.md (2026-05-16). That document cites every Hermes property with file:line pointers; the structured extract this paper consumes is at apps/web/content/research/data/hermes-baseline-summary.json.
Per the audit-registry rules in scripts/audit-research-claims.mjs, every number in this section is a projection of Hermes-baseline capability — we did not run Hermes ourselves and measure its outputs in this session.
1.1 What Hermes is
Two admin-swarm loops:
- Scanner — every 4 hours, 3 tools, Codex-safe. Reads the last 80 rows of
jarvis_heartbeatsand last 100 rows ofjarvis_actions(tooling/admin-swarm/src/loops/hermes-scanner.ts:100-122). Output: 1–3 patterns per cycle, persisted tocortex_memorieswith a 1-day TTL (hermes-scanner.ts:179). - Trainer — every 12 hours, 6 tools, cap of 2 synthesis actions per cycle (
hermes-trainer.ts:437). Output: optional directive (12h TTL) and/or skill (permanent). LLM-synthesis is off by default —HERMES_ENABLE_LLM_SYNTHESISdefaults tofalse(hermes-trainer.ts:41), so the default Trainer cycle is a deterministic no-op fallback.
1.2 Hermes capability vector (projection)
| Axis | Hermes baseline value | Source |
|---|---|---|
| Scan breadth (distinct signal tables) | 2 | hermes-scanner.ts:100-122 |
| Action diversity (distinct action types) | 2 (directive, skill) | hermes-trainer.ts:227-277 |
| Trainer cycle throughput at design cadence | 14 cycles/week | derived: 7d×24h ÷ 12h |
| Max skills per cycle | 2 | hermes-trainer.ts:437 |
| Design capacity skills/week | 28 | derived: 14 cycles × 2 skills |
| Observed skills/week (2026-05-12 snapshot) |
The structured JSON behind this table is apps/web/content/research/data/hermes-baseline-summary.json, produced by scripts/research-experiments/hermes-baseline-summary.mjs.
2. Measured Armalo Baseline — 2026-05-17
All numbers in this section are measurements produced by scripts/research-experiments/rsi-1000x-bench.mjs against production Neon Postgres on 2026-05-17, with output at apps/web/content/research/data/rsi-1000x-bench.json. Window: last 30 days unless otherwise noted.
2.1 Substrate volume
| Table | 30-day rows | Total rows | Notes |
|---|---|---|---|
harness_evidence | 31,000 | 31,000 | Primary RSI evidence ledger |
score_history | 1,352 | 1,956 | 79 distinct agents scored in window |
harness_runs | 0 | 0 | Schema present, write-path not yet shipping |
agent_variants | 1 | 1 | Customer-side variant flywheel not yet live |
variant_invocations |
The agent_variants substrate is sparse (1 row over 30 days) — this is the single largest gap to 1000x, called out explicitly in §3 and §4.
2.2 Cycle improvement rate
Metric definition: average composite_score delta per cycle across agents with ≥2 score_history rows in the 30-day window. A "cycle" = one score_history row. Δ/cycle = (last_score − first_score) ÷ (n_cycles − 1).
| Metric | Value |
|---|---|
| Sample N | 54 agents |
| Avg cycles per agent (30d) | 24.6 |
| Avg Δscore per cycle | +4.65 points |
| Avg total Δscore per agent over 30d | +57.26 points |
| Median Δscore per cycle | 0.00 (heavy zero mass — many cycles re-score without change) |
| Σ positive Δscore | 3,731 |
| Σ net Δscore | 3,092 |
Sample N=54 ≥ 30 — adequate for reporting. The non-zero median is a deliberate honest note: a majority of cycles are re-scores with no movement, but the right tail of improvers carries the average.
2.3 Skills auto-promoted per agent per week
| Source | 7d | 30d | Notes |
|---|---|---|---|
agent_variants rows | 0 | 1 | Customer-side flywheel: single shadow prompt for role=distro |
variant_invocations rows | — | 0 | No measured customer-side variant invocations |
Hermes-skill proxy (cortex_memories role=hermes tagged 'skill') | 0 | 0 | Matches Hermes gauntlet memoryWrites7d: 0 |
Measured Armalo skills/week = 0 on the closest current proxy. The schema is in place (agent_variants table, variant_invocations table, cortex_memories role=hermes skill rows) but the producer loops have not landed enough writes in 30 days to compute a meaningful ratio.
2.4 Pareto frontier expansion (top decile, 7d)
| Window | Frontier size (top decile) | Population |
|---|---|---|
| Now | 9 | 78 |
| 7 days ago | 9 | 75 |
| Expansion ratio | 0.889x | (8/9 — frontier counted via PERCENT_RANK ≥ 0.9) |
The frontier has slightly contracted in the past 7 days — the population grew (75 → 78) faster than the top decile (the rank threshold tightened by one agent). This is not a regression in agent quality; it is a sample-size artifact at population N < 100.
2.5 Cost per quality point
| Quantity | Value |
|---|---|
| Σ harness_evidence.cost_usd (30d) | $223.40 |
| Σ harness_evidence tokens (30d) | 70,570,104 |
| Σ positive Δscore (30d) | 3,731 |
| Cost per quality point | $0.0599 USD |
For comparison, the Hermes baseline per-cycle cost is $0 (OAuth path) and effectively unbounded (no quality-point denominator measured) — apps/web/content/research/data/hermes-baseline-summary.json:cost.
2.6 Action diversity
| Side | Distinct action types in 30d | Distinct roles producing evidence |
|---|---|---|
| Armalo platform-wide | 105 | 22 |
| Hermes baseline | 2 (directive, skill) | 1 (hermes) |
| Ratio | 52.5x | 22x |
2.7 Scan breadth
| Side | Distinct `signal_source` values in 30d |
|---|---|
| Armalo platform-wide | 4 |
| Hermes baseline (jarvis_heartbeats, jarvis_actions) | 2 |
| Ratio | 2x |
The 4-source measurement reflects what is currently recorded in harness_evidence.signal_source; the platform-wide ground truth of signal tables read across roles is higher but not yet uniformly instrumented. See §4 for the fix.
2.8 Compounding chains in evidence
| Field | Value |
|---|---|
| Evidence rows with parent_id (30d) | (see data file) |
| Distinct parent_ids (30d) | (see data file) |
Raw rows are in apps/web/content/research/data/rsi-1000x-bench.json:compounding_chains. A second-order compounding flywheel (skills that learn from skill outcomes) is not yet expressed in this data.
3. Comparison Summary — Armalo / Hermes
| Axis | Armalo (measured 2026-05-17) | Hermes (projection from 2026-05-16 baseline) | Ratio | 1000x target gap |
|---|---|---|---|---|
| Scan breadth (signal sources) | 4 | 2 | 2x | needs 500x more |
| Action diversity (action types) | 105 | 2 | 52.5x | needs 19x more |
| Cycle improvement rate (Δscore/cycle) | +4.65 | not measured by Hermes | n/a | Hermes has no equivalent metric |
| Skills auto-promoted (per week) | 0 (substrate empty) | 0 (gauntlet observed) | undefined | both ship 0 today |
Highest measured improvement: 52.5x action diversity (105 vs 2). All other dimensions are within ≤2x or undefined because one side has no measurable substrate. No measured axis reaches 1000x.
4. The Gap and the Buildable Path
The three concrete reasons the measured ratio is not yet 1000x:
- 1.The customer-side variant flywheel has not started writing.
agent_variantshas 1 row in 30 days;variant_invocationshas 0. Without these tables filling, neither "skills auto-promoted per agent per week" nor "Pareto frontier expansion" can rise meaningfully. The schema exists atpackages/db/src/schema/agent-variants.tsandpackages/db/src/schema/variant-invocations.ts; the producer is not yet emitting in production. - 2.Causal verification is absent on both sides. Hermes has no automated "directive sent → metric moved" check (
hermes-trainer.ts:227-246writes directives but no tool reads acknowledgments). Armalo has 1,174harness_evidence.phase = verifyrows in 30d, but theexpected_impactvsobserved_impactJSONB columns are not yet wired into a real causal scorer. - 3.Compounding stays single-layer on both sides. No skill effectiveness score, no skill graveyard, no second-order skill-of-skills. Both surfaces are SCHEMA-READY but not POPULATED. Until variants are invoked and their outcomes are scored, no second-order compounding can express.
The four code surfaces that must ship to credibly claim 1000x — each labeled as a projection of the work required, not a measurement:
| Surface | What ships | Where it lives today | Projected ratio it unlocks |
|---|---|---|---|
| Agent-variants flywheel activation | Promote variants on outcome-positive variant_invocations; demote on negative; persist effectiveness scores | packages/db/src/schema/agent-variants.ts + variant_invocations.ts (tables present, producer absent) | Skills/week from 0 → projected design capacity (this paper does not claim 1000x for this axis until at least one full cycle of variant promotion has been measured) |
| Multi-source scanner | Replace the 2-table Hermes scan with a 9-table scan across execution_traces, room_events, llm_dispatch_jobs, audit_log, cortex_memories, swarm_memory_entries, youtube_api_events |
The 1000x claim is buildable on the action diversity × compounding depth product, not on a single axis. Armalo already has 52.5x action diversity; multiplied by a credible 20x compounding gain (single-layer → multi-layer with effectiveness scoring) yields ≥1000x on the composite RSI vector. Both factors are pre-requisites; this paper measures one (action diversity) and projects the other (compounding).
5. Replication
This paper is reproducible end-to-end. The pipeline:
- 1.Build Hermes baseline JSON — from forensic research:
`` node scripts/research-experiments/hermes-baseline-summary.mjs # Writes apps/web/content/research/data/hermes-baseline-summary.json Every field in this file is annotated with its provenance: code-ref (with file:line), gauntlet (with snapshot path), or derivation` (with the formula). No measurement provenance.
- 1.Measure Armalo present-day baseline — against production Neon:
`` DATABASE_URL=... node scripts/research-experiments/rsi-1000x-bench.mjs # Writes apps/web/content/research/data/rsi-1000x-bench.json The script queries harness_evidence, score_history, agent_variants, variant_invocations, cortex_memories, runs 7 metric computations, and computes comparison ratios against the Hermes baseline JSON. Per-agent rows (top 100 by Δscore) are included for replication. The script tolerates empty substrate — any metric whose substrate is < expected sample size is reported as null with a reason` string.
- 1.Run the audit gate:
`` pnpm research:audit Per the integrity rules at .claude/CLAUDE.md, every quantitative claim in this paper must be registered in apps/web/content/research/claims-registry.json` with provenance one of {measurement, code-ref, derivation, projection}. The audit script fails the build if a number appears in prose that is not registered.
The Hermes baseline is a projection because it summarizes external research; the Armalo baseline is a measurement because it ran live against production. The 1000x target is a projection of where Armalo could go after the four code surfaces in §4 ship. No number in this paper is a measurement of Armalo achieving 1000x Hermes RSI — only the four sub-ratios from §3.
How to extend this paper
When the four code surfaces in §4 land, re-run the bench:
node scripts/research-experiments/rsi-1000x-bench.mjsCompare the new rsi-1000x-bench.json against this one. Each measured ratio's movement is documented. If agent_variants_30d rises above 30 with positive variant_invocations, the skill-synthesis ratio becomes measurable for the first time. If expected_impact / observed_impact are populated for ≥30 variants, the causal verifier ratio becomes measurable. Until those substrates fill, the paper honestly says "not measured" rather than naming a number.
6. Conclusion
We measured the Armalo RSI baseline against Hermes on six axes. The single dimension where Armalo already exceeds Hermes by a wide margin is action diversity (52.5x). The other dimensions are short of 1000x either because the customer-side substrate has not yet started writing (agent_variants has 1 row) or because the comparison axis is single-layer on both sides. The paper does not claim 1000x today; it documents the present-day ratio, names the four code surfaces required to credibly claim 1000x, and provides a re-runnable measurement harness so the claim can be re-checked the moment the substrates fill. The integrity discipline of pnpm research:audit is what makes that future re-run honest rather than a fresh round of fabrication.