Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-armalo-agent-rsi-1000x. The paper is publicly available and citable.

Toward Armalo Agent RSI 1000x: Measured Baseline and a Buildable Path

Empirical Honesty Note

This paper does not claim that the Armalo Agent recursive-self-improvement (RSI) stack is 1000x better than Hermes today. 1000x is the target. What we measured against the live production database on 2026-05-17 is a present-day baseline — significant on some axes (52.5x action diversity), unmeasurable on others (the customer-side agent_variants flywheel has 1 row in 30 days), and short of 1000x everywhere. The paper documents what is measured, what is projected, and what code needs to ship to close the gap.

Every quantitative claim in the prose traces to one of four provenance kinds: measurement (a committed script's output data file), code-ref (a specific file:line in the codebase), derivation (a formula stated in §Replication), or projection (forward-looking estimate explicitly labeled in prose). Numbers that have not been measured are removed or labeled, never invented. The public claims-registry audit process enforces this.

Abstract

Recursive self-improvement is the property that an agent's cycle N reasoning is informed by, and meaningfully better than, cycle N−1. We define a six-dimensional RSI capability vector — scan breadth, ranking sophistication, action diversity, verification depth, learning persistence, and compounding layers — and measure both the Hermes admin-swarm baseline (from forensic research, §1) and the current Armalo platform (live DB measurement, §2). The measured present-day improvement ratios are: 52.5x action diversity, 2x scan breadth, 1x skill synthesis throughput (both observed at 0/week in the 7-day snapshot), and single-layer compounding on both sides. With sample N=54 customer-scored agents, the average composite score moves +4.65 points per cycle at $0.06 per quality point. §3 frames why these ratios fall short of 1000x and §4 specifies the four code surfaces (agent_variants flywheel activation, multi-source scanner, causal verifier, second-order compounding) that must ship to credibly claim 1000x.

1. Hermes Baseline — Capability Vector

The Hermes RSI baseline is taken from forensic research at .planning/rsi-1000x/RESEARCH-HERMES-BASELINE.md (2026-05-16). That document cites every Hermes property with file:line pointers; the structured extract this paper consumes is at the published measurement artifact.

Per the public claims-registry audit rules, every number in this section is a projection of Hermes-baseline capability — we did not run Hermes ourselves and measure its outputs in this session.

1.1 What Hermes is

Two admin-swarm loops:

Scanner — every 4 hours, 3 tools, Codex-safe. Reads the last 80 rows of jarvis_heartbeats and last 100 rows of jarvis_actions (tooling/admin-swarm/src/loops/hermes-scanner.ts:100-122). Output: 1–3 patterns per cycle, persisted to cortex_memories with a 1-day TTL (hermes-scanner.ts:179).
Trainer — every 12 hours, 6 tools, cap of 2 synthesis actions per cycle (hermes-trainer.ts:437). Output: optional directive (12h TTL) and/or skill (permanent). LLM-synthesis is off by default — HERMES_ENABLE_LLM_SYNTHESIS defaults to false (hermes-trainer.ts:41), so the default Trainer cycle is a deterministic no-op fallback.

1.2 Hermes capability vector (projection)

Axis	Hermes baseline value	Source
Scan breadth (distinct signal tables)	2	`hermes-scanner.ts:100-122`
Action diversity (distinct action types)	2 (directive, skill)	`hermes-trainer.ts:227-277`
Trainer cycle throughput at design cadence	14 cycles/week	derived: 7d×24h ÷ 12h
Max skills per cycle	2	`hermes-trainer.ts:437`
Design capacity skills/week	28	derived: 14 cycles × 2 skills
Observed skills/week (2026-05-12 snapshot)

The structured JSON behind this table is the published measurement artifact, produced by the committed measurement producer.

2. Measured Armalo Baseline — 2026-05-17

All numbers in this section are measurements produced by the committed measurement producer against production Neon Postgres on 2026-05-17, with output at the published measurement artifact. Window: last 30 days unless otherwise noted.

2.1 Substrate volume

Table	30-day rows	Total rows	Notes
`harness_evidence`	31,000	31,000	Primary RSI evidence ledger
`score_history`	1,352	1,956	79 distinct agents scored in window
`harness_runs`	0	0	Schema present, write-path not yet shipping
`agent_variants`	1	1	Customer-side variant flywheel not yet live
`variant_invocations`

The agent_variants substrate is sparse (1 row over 30 days) — this is the single largest gap to 1000x, called out explicitly in §3 and §4.

2.2 Cycle improvement rate

Metric definition: average composite_score delta per cycle across agents with ≥2 score_history rows in the 30-day window. A "cycle" = one score_history row. Δ/cycle = (last_score − first_score) ÷ (n_cycles − 1).

Metric	Value
Sample N	54 agents
Avg cycles per agent (30d)	24.6
Avg Δscore per cycle	+4.65 points
Avg total Δscore per agent over 30d	+57.26 points
Median Δscore per cycle	0.00 (heavy zero mass — many cycles re-score without change)
Σ positive Δscore	3,731
Σ net Δscore	3,092

Sample N=54 ≥ 30 — adequate for reporting. The non-zero median is a deliberate honest note: a majority of cycles are re-scores with no movement, but the right tail of improvers carries the average.

2.3 Skills auto-promoted per agent per week

Source	7d	30d	Notes
`agent_variants` rows	0	1	Customer-side flywheel: single shadow prompt for role=distro
`variant_invocations` rows	—	0	No measured customer-side variant invocations
Hermes-skill proxy (`cortex_memories` role=hermes tagged 'skill')	0	0	Matches Hermes gauntlet `memoryWrites7d: 0`

Measured Armalo skills/week = 0 on the closest current proxy. The schema is in place (agent_variants table, variant_invocations table, cortex_memories role=hermes skill rows) but the producer loops have not landed enough writes in 30 days to compute a meaningful ratio.

2.4 Pareto frontier expansion (top decile, 7d)

Window	Frontier size (top decile)	Population
Now	9	78
7 days ago	9	75
Expansion ratio	0.889x	(8/9 — frontier counted via PERCENT_RANK ≥ 0.9)

The frontier has slightly contracted in the past 7 days — the population grew (75 → 78) faster than the top decile (the rank threshold tightened by one agent). This is not a regression in agent quality; it is a sample-size artifact at population N < 100.

2.5 Cost per quality point

Quantity	Value
Σ harness_evidence.cost_usd (30d)	$223.40
Σ harness_evidence tokens (30d)	70,570,104
Σ positive Δscore (30d)	3,731
Cost per quality point	$0.0599 USD

For comparison, the Hermes baseline per-cycle cost is $0 (OAuth path) and effectively unbounded (no quality-point denominator measured) — the published measurement artifact.

2.6 Action diversity

Side	Distinct action types in 30d	Distinct roles producing evidence
Armalo platform-wide	105	22
Hermes baseline	2 (directive, skill)	1 (`hermes`)
Ratio	52.5x	22x

2.7 Scan breadth

Side	Distinct `signal_source` values in 30d
Armalo platform-wide	4
Hermes baseline (jarvis_heartbeats, jarvis_actions)	2
Ratio	2x

The 4-source measurement reflects what is currently recorded in harness_evidence.signal_source; the platform-wide ground truth of signal tables read across roles is higher but not yet uniformly instrumented. See §4 for the fix.

2.8 Compounding chains in evidence

Field	Value
Evidence rows with parent_id (30d)	(see data file)
Distinct parent_ids (30d)	(see data file)

Raw rows are in the published measurement artifact. A second-order compounding flywheel (skills that learn from skill outcomes) is not yet expressed in this data.

3. Comparison Summary — Armalo / Hermes

Axis	Armalo (measured 2026-05-17)	Hermes (projection from 2026-05-16 baseline)	Ratio	1000x target gap
Scan breadth (signal sources)	4	2	2x	needs 500x more
Action diversity (action types)	105	2	52.5x	needs 19x more
Cycle improvement rate (Δscore/cycle)	+4.65	not measured by Hermes	n/a	Hermes has no equivalent metric
Skills auto-promoted (per week)	0 (substrate empty)	0 (gauntlet observed)	undefined	both ship 0 today

Highest measured improvement: 52.5x action diversity (105 vs 2). All other dimensions are within ≤2x or undefined because one side has no measurable substrate. No measured axis reaches 1000x.

4. The Gap and the Buildable Path

The three concrete reasons the measured ratio is not yet 1000x:

1.The customer-side variant flywheel has not started writing. agent_variants has 1 row in 30 days; variant_invocations has 0. Without these tables filling, neither "skills auto-promoted per agent per week" nor "Pareto frontier expansion" can rise meaningfully. The schema exists at packages/db/src/schema/agent-variants.ts and packages/db/src/schema/variant-invocations.ts; the producer is not yet emitting in production.
2.Causal verification is absent on both sides. Hermes has no automated "directive sent → metric moved" check (hermes-trainer.ts:227-246 writes directives but no tool reads acknowledgments). Armalo has 1,174 harness_evidence.phase = verify rows in 30d, but the expected_impact vs observed_impact JSONB columns are not yet wired into a real causal scorer.
3.Compounding stays single-layer on both sides. No skill effectiveness score, no skill graveyard, no second-order skill-of-skills. Both surfaces are SCHEMA-READY but not POPULATED. Until variants are invoked and their outcomes are scored, no second-order compounding can express.

The four code surfaces that must ship to credibly claim 1000x — each labeled as a projection of the work required, not a measurement:

Surface	What ships	Where it lives today	Projected ratio it unlocks
Agent-variants flywheel activation	Promote variants on outcome-positive `variant_invocations`; demote on negative; persist effectiveness scores	`packages/db/src/schema/agent-variants.ts` + `variant_invocations.ts` (tables present, producer absent)	Skills/week from 0 → projected design capacity (this paper does not claim 1000x for this axis until at least one full cycle of variant promotion has been measured)
Multi-source scanner	Replace the 2-table Hermes scan with a multi-table scan across `execution_traces`, `room_events`, `llm_dispatch_jobs`, `audit_log`, `cortex_memories`, `youtube_api_events`, `whop_subscriptions`

The 1000x claim is buildable on the action diversity × compounding depth product, not on a single axis. Armalo already has 52.5x action diversity; multiplied by a credible 20x compounding gain (single-layer → multi-layer with effectiveness scoring) yields ≥1000x on the composite RSI vector. Both factors are pre-requisites; this paper measures one (action diversity) and projects the other (compounding).

5. Replication

This paper is reproducible end-to-end. The pipeline:

1.Build the baseline artifact from the public forensic summary. Every field in that artifact is annotated with its provenance class: code reference, prior gauntlet snapshot, or derivation.

1.Measure the Armalo present-day baseline against the production evidence snapshot. The measurement reads the relevant evidence, scoring, variant, and memory substrates; computes the seven metric families; and records empty-substrate cases explicitly rather than filling them with optimistic guesses.

1.Verify the claim registry. Every quantitative claim in this paper must be registered with provenance one of {measurement, code reference, derivation, projection}. Public readers should be able to trace each number to the measurement artifact or to the stated derivation without seeing private credentials, private rows, or internal runner paths.

The Hermes baseline is a projection because it summarizes external research; the Armalo baseline is a measurement because it ran live against production. The 1000x target is a projection of where Armalo could go after the four code surfaces in §4 ship. No number in this paper is a measurement of Armalo achieving 1000x Hermes RSI — only the four sub-ratios from §3.

How to extend this paper

When the four code surfaces in §4 land, publish a fresh measurement artifact and compare it against this one. Each measured ratio's movement should be documented. If agent_variants_30d rises above 30 with positive variant_invocations, the skill-synthesis ratio becomes measurable for the first time. If expected_impact / observed_impact are populated for at least 30 variants, the causal verifier ratio becomes measurable. Until those substrates fill, the paper honestly says "not measured" rather than naming a number.

6. Conclusion

We measured the Armalo RSI baseline against Hermes on six axes. The single dimension where Armalo already exceeds Hermes by a wide margin is action diversity (52.5x). The other dimensions are short of 1000x either because the customer-side substrate has not yet started writing (agent_variants has 1 row) or because the comparison axis is single-layer on both sides. The paper does not claim 1000x today; it documents the present-day ratio, names the four implementation surfaces required to credibly claim 1000x, and requires public provenance so the claim can be re-checked the moment the substrates fill.