A preview of Armalo Build's SWE-bench Verified methodology, including the governed and SWE-tuned configurations and the signed trust receipt artifact that will accompany each evaluated patch.
This page is a preview for the SWE-bench Verified writeup. Real numbers and the per-repo breakdown land here once the full 500-task run completes. Tracking dashboard: /dashboard/admin/swe-bench.
The pipeline that will produce the published number is live in [packages/swe-bench-eval](https://github.com/fongryan/armalo). Until the run finishes, this preview describes the methodology and the trust-receipt artifact that ships with every patch โ pass or fail.
The agent gets problem_statement and hints_text from the original issue. It never sees FAIL_TO_PASS / PASS_TO_PASS test identifiers. The patch is evaluated inside the official swebench/sweb.eval.x86_64.* Docker image used by the public leaderboard.
Two configurations, both published
Configuration
What it is
Cite this work
Armalo Labs Research Team (2026). Armalo Build on SWE-bench Verified โ preview. Armalo Labs Technical Series, Armalo AI. https://www.armalo.ai/labs/research/2026-05-12-swe-bench-trust-receipts-preview
Armalo Labs Technical Series ยท ISSN pending
Explore the trust stack behind the research
These papers are built from the same trust questions Armalo is turning into product surfaces: pacts, trust oracles, attestations, and runtime evidence.
Production pipeline: parallel coders ร 3, pre-plan + post-plan + per-file + pre-deliver jury votes, policy gates, full trust receipt minted and signed.
Armalo Build (SWE-tuned)
Single-shot coder, only the pre-deliver jury vote, no policy gates. Apples-to-apples comparison with frontier-lab agentic harnesses.
We publish both because the gap between them is the cost of governance โ the price a CISO pays to get a signed audit trail for every line of AI code. Our hypothesis: the gap is small enough that any enterprise should pay it gladly.
What's actually in the trust receipt
Every patch โ pass or fail โ produces a W3C Verifiable Credential containing:
Model provenance for every file (which LLM wrote it, version, dispatcher logs)
Each juror's verdict with reasoning chain (outliers trimmed before consensus)
Every eval check that ran with its output log
Policy gate decisions (secret-leak, scope-honesty, zero-trust โ including rejections)
SHA-256 of the unified diff (tamper-detectable)
Ed25519 signature by a per-run DID rooted at did:web:armalo.ai
You can verify any Armalo receipt without contacting us:
Per-repo breakdown (where are we strong? where weak?)
Failure mode analysis (patch_apply_failed vs regression vs fail_to_pass_incomplete)
Cost and wall-clock distributions
Submission to the public swebench.com leaderboard
How to reproduce
git clone https://github.com/fongryan/armalo
cd armalo/packages/swe-bench-eval
pnpm install
pnpm full default # 500 tasks, default config
pnpm full swe-tuned # 500 tasks, SWE-tuned config
Live receipts will populate at [/trust/build-receipts](/trust/build-receipts) as the run progresses.
The score is the price of admission. The receipt is the product.
Empirical Honesty Note
The numeric examples in this paper's prose are illustrative parameterizations of the framework, not measurements from a deployed study. Where percentages, basis points, dollar amounts, per-agent counts, latencies, or correlation coefficients appear, they are anchor values used to make the model concrete โ they should be read as projections, not as observed values from Armalo production data. This paper predates the claims-registry audit gate (effective 2026-05-13); the honesty note is added retroactively to bring the paper into compliance with the integrity workflow at scripts/audit-research-claims.mjs.
Replication
To produce real measurements in place of the illustrative anchors:
1.Identify each metric as a query against Armalo production tables (agents, scores, pacts, pact_interactions, evals, eval_checks, escrows, transactions, cortex_memories, audit_log, room_events).
2.Commit a measurement script under scripts/research-experiments/<slug>.mjs that executes the query and writes raw output to apps/web/content/research/data/<slug>.json.
3.Update this paper to replace illustrative values with measured values, register them in apps/web/content/research/claims-registry.json with provenance: measurement, and re-run pnpm research:audit to verify.
The production-snapshot generator at scripts/research-experiments/production-snapshot.mjs is a reusable starting point for substrate volumes (agent counts, tier distribution, escrow flow, eval volume, cortex memory volume, room-event volume).
Behavioral Attestations: Cryptographic Trust History for AI Agents at Production Scale