Armalo Build on SWE-bench Verified — preview

Q: Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-05-12-swe-bench-trust-receipts-preview. The paper is publicly available and citable.

Armalo Labs Research Team

Abstract

A preview of Armalo Build's SWE-bench Verified methodology, including the governed and SWE-tuned configurations and the signed trust receipt artifact that will accompany each evaluated patch.

This page is a preview for the SWE-bench Verified writeup. Real numbers and the per-repo breakdown land here once the full 500-task run completes. Tracking dashboard: /dashboard/admin/swe-bench.

The pipeline that will produce the published number is live in [packages/swe-bench-eval](https://github.com/fongryan/armalo). Until the run finishes, this preview describes the methodology and the trust-receipt artifact that ships with every patch — pass or fail.

What we measure

princeton-nlp/SWE-bench_Verified — 500 human-validated GitHub issues from twelve Python repos (Django, sympy, Sphinx, scikit-learn, sqlfluff, requests, pylint, astropy, xarray, matplotlib, Flask, pvlib).

The agent gets problem_statement and hints_text from the original issue. It never sees FAIL_TO_PASS / PASS_TO_PASS test identifiers. The patch is evaluated inside the official swebench/sweb.eval.x86_64.* Docker image used by the public leaderboard.

Two configurations, both published

Configuration	What it is

Cite this work

Armalo Labs Research Team (2026). Armalo Build on SWE-bench Verified — preview. Armalo Labs Technical Series, Armalo AI. https://www.armalo.ai/labs/research/2026-05-12-swe-bench-trust-receipts-preview

Armalo Labs Technical Series · ISSN pending

Explore the trust stack behind the research

These papers are built from the same trust questions Armalo is turning into product surfaces: pacts, trust oracles, attestations, and runtime evidence.

Read product docs Build with Armalo

Related Research

Eval Methodology

The Halt Authority: Told to Keep Improving Already-Correct Work, an Unanchored Agent Destroys It 76% of the Time

Read paper Eval Methodology

The Recursive Self-Improvement Ceiling: Unanchored Self-Revision Captures Less Than Half the Repair an External Checker Does

Read paper

curl -X POST https://armalo.ai/api/v1/trust/build-receipts/verify-external \
  -H 'content-type: application/json' \
  -d '{ "receipt": { /* paste VC */ } }'

git clone https://github.com/fongryan/armalo
cd armalo/packages/swe-bench-eval
pnpm install
pnpm full default       # 500 tasks, default config
pnpm full swe-tuned     # 500 tasks, SWE-tuned config

Armalo Build on SWE-bench Verified — preview

What we measure

Two configurations, both published

Explore the trust stack behind the research

Related Research

What's actually in the trust receipt

What's coming

How to reproduce

Empirical Honesty Note

Replication