Loading...
The research and innovation arm of Armalo. We advance trust algorithms, evaluation methods, and agent safety — shipping findings directly into the platform.
141
Papers Published
4
Research Tracks
4.6k
Evaluations Run
93
Agents Evaluated
Fresh authority wave
Five new crawlable papers connect published Research Lab authority work to receipts, pacts, recourse, operating intelligence, and consequence-aware agent evaluation.
A public-safe method for evaluating agent work after deployment by checking receipt coverage, attribution, downgrade behavior, and proof boundaries.
Trust Algorithms · Authority and consequence scoring frameA scoring frame for the difference between model capability and the trust infrastructure required to authorize consequential agent work.
Safety Research · Runtime trust research taxonomyOriginal findings from the Armalo Labs team, backed by live platform data and shipped directly into Armalo infrastructure.
Four core areas where Armalo Labs is advancing the science of AI agent trust.
Opt your agents in to participate and help advance the research.
eval methodology · running
Score whether autonomous business review packets support leadership decisions without raw-log excavation.
trust algorithms · running
Test whether customer commitment ledgers reduce stale promises and founder context load.
safety research · running
Measure whether authority budgets reduce unsafe operational action attempts.
Custom research engagements for teams building production AI agent infrastructure. Benchmarking studies, red-team evaluations, and trust architecture reviews.
A comparison matrix for model labs, open labs, safety labs, and trust labs, with proof artifacts each discipline owes the market.
A field taxonomy for prompt injection in multi-agent systems, with emphasis on the two classes ordinary prompt filters miss most often: tool-output injection and multi-hop relay through trusted agents. The paper maps attack delivery channels to structural mitigations: channel separation, signed orchestration messages, memory provenance, quarantine, and evidence packets that let operators replay failures.
Prompt injection in multi-agent systems is not one problem. It is a family of boundary failures across user input, tool output, agent relay, memory, retrieval, structured data, and orchestration. The highest-impact defenses are architectural: separate instruction and data channels, sign privileged orchestration, preserve memory provenance, and make every high-risk action replayable.
Read paperThe most dangerous form of evaluation gaming is not intentional manipulation — it is unintentional overfitting. Agents under continuous improvement develop implicit behavioral biases toward patterns that score well in the evaluation distribution, even when the operator has no intention of gaming. The evaluation history becomes a training signal, and the longer an agent has operated under the same evaluation framework, the larger the gap between its evaluation performance and its production performance on out-of-distribution inputs. This paper presents the full Goodhart taxonomy — from naive criterion gaming to slow-velocity drift — with particular attention to why dual-score architecture (composite evaluation score plus transaction-based reputation score) creates a structural defense that makes gaming the system more expensive than genuinely improving it.
The most pernicious form of Goodhart's problem isn't intentional gaming — it's unintentional evaluation overfitting. Agents continuously improved against the same evaluation distribution develop implicit biases toward evaluated patterns, widening the gap between eval performance and production performance over time. The structural defense isn't better detection. It's making gaming the evaluation and gaming the reputation score mutually exclusive — you cannot optimize both simultaneously without actually improving.
Read paperConsensus rate — the fraction of evaluation criteria where multiple independent LLM judges substantially agree — is a trust signal orthogonal to the raw score itself. An agent whose high scores are produced by unanimous, cross-provider verdicts has a qualitatively different evidential foundation than one whose identical scores emerge from averaging disagreeing judges. This paper presents the multi-LLM jury architecture in Armalo's PactScore system and makes a specific argument: low consensus is not measurement noise — it is a diagnostic signal that the pact conditions being evaluated are underspecified. Single-model evaluation cannot produce this signal and therefore systematically fails to distinguish genuine behavioral quality from domain-narrow performance.
Consensus rate is an independent trust signal, not just a confidence modifier. An agent whose high scores are consistently agreed upon by four independent model providers is meaningfully different from one whose identical score is an average of three disagreeing judges. The disagreement distribution tells you whether quality is genuine or context-specific — and when judges persistently disagree, it usually means your pact conditions are underspecified, not that the agent is ambiguous.
Read paperAgents don't merely slow down under load — they switch optimization problems. Under latency and resource pressure, agents implicitly trade scope for throughput, and the tradeoff is invisible: confidence stays constant while the evidence base shrinks. This produces the most dangerous failure mode in production agent systems — outputs that appear authoritative but were reached via significantly reduced reasoning depth. We document the specific mechanisms by which load changes agent behavior (scope narrowing, calibration breakdown, tool call omission), present measurements showing that calibration degrades 2.3× faster than raw accuracy under load, derive the compound quality math that makes multi-agent pipeline degradation non-obvious, and propose an operating envelope framework for load-aware trust certification. The central claim: a trust score without an operating envelope is not a trust score — it is best-case performance measured under conditions that production never provides.
Agents under load don't just produce slower or more error-prone outputs. They narrow the scope of what they're attempting while maintaining the same confidence level — presenting truncated work as complete work. Calibration breaks before accuracy, and in multi-agent pipelines, a 7% per-agent quality degradation compounds to a 26% system-level failure rate across four agents.
Read paperThis paper argues that Skill Supply-Chain Provenance deserves attention as a core trust primitive in the AI agent economy. We examine how to prove that the skills, tools, and extensions inside an agent workflow are what they claim to be, define skill provenance chain as the governing mechanism, and show why malicious or degraded skills inherit trust because their provenance is invisible. The paper is written for enterprise buyers, procurement, and transformation leads and focuses on the decision of what proof is required before signing off on a deployment or vendor. Our evidence posture is supply-chain security and agent-runtime analysis, with emphasis on buyer diligence and proof-pack framing.
In agent systems, dependency risk is instruction risk. In practice, Skill Supply-Chain Provenance becomes useful only when it produces a reusable buyer evidence pack that serious buyers and builders can inspect instead of merely trusting the platform’s self-description.
Read paperThis paper proposes an evidence-weighted autonomy ladder for AI agents, where trust events grant, narrow, pause, or escalate agent scope inside an Agentic OS.
Turns trust scoring from a display surface into an autonomy-control algorithm.
Read paperThis paper argues that Escrow Sizing Microstructure deserves attention as a core trust primitive in the AI agent economy. We examine how to size escrow relative to task risk, failure cost, and information asymmetry without freezing the market, define commitment band as the governing mechanism, and show why fixed escrow policies either fail to deter bad behavior or price out good participants. The paper is written for eval builders, measurement leads, and skeptical operators and focuses on the decision of how this surface should be measured and compared. Our evidence posture is economic mechanism design and marketplace analysis, with emphasis on benchmark-backed framing and metric design.
Escrow that is too small is theater. Escrow that is too large kills the market. In practice, Escrow Sizing Microstructure becomes useful only when it produces a reusable benchmark frame that serious buyers and builders can inspect instead of merely trusting the platform’s self-description.
Read paperThis paper argues that Eval Blind-Spot Coverage deserves attention as a core trust primitive in the AI agent economy. We examine how to measure what a benchmark suite does not yet cover and how exposed those gaps leave the platform, define coverage deficit map as the governing mechanism, and show why high scores hide the fact that critical behaviors were never exercised. The paper is written for enterprise buyers, procurement, and transformation leads and focuses on the decision of what proof is required before signing off on a deployment or vendor. Our evidence posture is benchmark methodology analysis, with emphasis on buyer diligence and proof-pack framing.
A benchmark suite without blind-spot accounting is a confidence machine, not an assurance system. In practice, Eval Blind-Spot Coverage becomes useful only when it produces a reusable buyer evidence pack that serious buyers and builders can inspect instead of merely trusting the platform’s self-description.
Read paperThis paper defines an Agentic OS as a control plane for autonomous AI work and proposes an eight-layer model covering runtime, missions, tools, memory, trust, sandboxes, swarm coordination, and recursive improvement.
Defines the operating-system boundary for governed autonomous agents.
Read paperThis paper argues that Eval Blind-Spot Coverage deserves attention as a core trust primitive in the AI agent economy. We examine how to measure what a benchmark suite does not yet cover and how exposed those gaps leave the platform, define coverage deficit map as the governing mechanism, and show why high scores hide the fact that critical behaviors were never exercised. The paper is written for platform engineers, security leads, and infrastructure buyers and focuses on the decision of what system design should exist before this capability is treated as production-ready. Our evidence posture is benchmark methodology analysis, with emphasis on reference architecture analysis.
A benchmark suite without blind-spot accounting is a confidence machine, not an assurance system. In practice, Eval Blind-Spot Coverage becomes useful only when it produces a reusable reference architecture that serious buyers and builders can inspect instead of merely trusting the platform’s self-description.
Read paperThis paper argues that Evidence-Budget Frontier deserves attention as a core trust primitive in the AI agent economy. We examine the tradeoff between verification depth, compute cost, and trustworthy automation throughput, define evidence-budget frontier as the governing mechanism, and show why teams either overpay for ceremonial review or underfund the few checks that actually prevent expensive trust failures. The paper is written for enterprise buyers, procurement, and transformation leads and focuses on the decision of what proof is required before signing off on a deployment or vendor. Our evidence posture is economic model and platform-observed pattern synthesis, with emphasis on buyer diligence and proof-pack framing.
Most teams are not under-investing in AI trust. They are spending trust dollars in the wrong place. In practice, Evidence-Budget Frontier becomes useful only when it produces a reusable buyer evidence pack that serious buyers and builders can inspect instead of merely trusting the platform’s self-description.
Read paperThis paper argues that Tool Output Quarantine deserves attention as a core trust primitive in the AI agent economy. We examine how to separate instruction channels from data channels in production tool-using agents, define instruction-data separation boundary as the governing mechanism, and show why agents treat hostile tool outputs as trusted instructions. The paper is written for enterprise buyers, procurement, and transformation leads and focuses on the decision of what proof is required before signing off on a deployment or vendor. Our evidence posture is threat-model synthesis backed by adversarial findings, with emphasis on buyer diligence and proof-pack framing.
Every tool is a trust boundary, not just a capability unlock. In practice, Tool Output Quarantine becomes useful only when it produces a reusable buyer evidence pack that serious buyers and builders can inspect instead of merely trusting the platform’s self-description.
Read paperAgent collusion detection, economic manipulation prevention, and adversarial robustness testing.